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Abstract 

We report the first formal verification of a reintegration protocol for a 
safety-critical, fault-tolerant, real-time distributed embedded system. A 
reintegration protocol increases system survivability by allowing a node 
that has suffered a fault to regain state consistent with the operational 
nodes. The protocol is verified in the Symbolic Analysis Laboratory 
(SAL), where bounded model checking and decision procedures are used 
to verify infinite-state systems by fc- induct ion. The protocol and its en- 
vironment are modeled as synchronizing timeout automata. Because k- 
induction is exponential with respect to fc, we optimize the formal model 
to reduce the size of fc. Also, the reintegrator’s event-triggered behavior is 
conservatively modeled as time-triggered behavior to further reduce the 
size of fc and to make it invariant to the number of nodes modeled. A 
corollary is that a clique avoidance property is satisfied. 
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1 Introduction 


Digital control (i.e., “x-by-wire”) systems are being designed for use in 
safety-critical environments such as automobiles, commercial aircraft, and 
piloted space vehicles. In a single vehicle, many systems require re- 
liable real-time intercommunication. Highly-reliable fault-tolerant vir- 
tual busses are being designed for this purpose [RusOl]. 1 Some notable 
examples of such busses include TTTech’s Time- Triggered Architecture 
(TTA) [Kop94], Honeywell’s SAFEbus [HD92], FlexRay (being developed 
by an automotive consortium) [LH02], and NASA Langley Research Cen- 
ter’s SPIDER [MMT02, MGPM04, NAS04], 

These busses are implemented as distributed systems to increase their 
fault-tolerance. A node in the distributed system may suffer a transient 
fault causing it to lose its volatile state but suffer no permanent damage. 
Although such a node may be fault-free, its state no longer is coordinated 
with that of the operational clique, the set of fault-free nodes with co- 
ordinating states allowing them to provide the requested services of the 
system. 2 Nodes in the operational clique are called operational nodes. 

If too many nodes become uncoordinated with the operational clique, 
the system degrades and becomes more susceptible to new faults. Too 
many simultaneous faults will lead the system to violate its maximum 
fault assumption (MFA), the maximum kind and number of faults the 
system is designed to withstand yet maintain correct operation. If the 
MFA is violated, no guarantees can be made about the system’s behavior. 

The extremely high reliability requirements for these busses coupled 
with the potential for a high number of transient faults in the environ- 
ments in which they may operate have led to the development of reinte- 
gration mechanisms for these systems. For a transiently-faulty node to 
regain correct state, it may execute a reintegration protocol. In a synchro- 
nized fault-tolerant distributed system, the reintegrating node (called the 
reintegrator) executes the protocol to resynchronize its local clock with 
those of the nodes in the operational clique. As well, it may need to 
regain diagnostic data consistent with the operational clique. A node’s 
diagnostic data are its view of which other nodes are faulty (messages 
from faulty nodes should be ignored). Other state may also be regained 
via the protocol; for example, if the system supports dynamic scheduling, 
this needs to be obtained, too. 

To the author’s knowledge, we present the first formal verification of 
a reintegration protocol. In [Rus02], Rushby describes the formal verifi- 
cation of TTA, one of the most mature and fully formally-verified busses 
in development, and therein states that the formal analysis of reintegra- 
tion remains important future work. The work presented here should be 
extensible to other fault-tolerant systems that employ reintegration pro- 
tocols, especially given that our verification is architecture-independent 

1 Rushby notes that the term ‘bus’ “understates their complexity, sophistication, and crit- 
icality,” [RusOl]. 

2 The operational clique and the set of non-faulty nodes are not necessarily equivalent: for 
example, a reintegrator is a non-faulty node not in the operational clique. This distinction 
can be subtle and in fact, a misunderstanding of it was partially responsible for a subtle error 
in the previous design of another SPIDER protocol [PMT04]. 
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(see Sec. 7). 

This work extends results in using bounded model checking and de- 
cision procedures to verify infinite-state systems using fc-induction (also 
known as temporal induction), a generalization of induction over transi- 
tion systems. In particular, we build on Dutertre and Sorea’s work in 
which they develop a timeout automata model for specifying and veri- 
fying real-time systems [DS04b, DS04a]. The formalism is particularly 
well-suited for fc-induction proofs over transition systems, and it does not 
require specialized algorithms for model checking (as opposed to, e.g., 
timed automata [Alu99]). 

Our focus is to make the fc-induction technique feasible for large sys- 
tems; this amounts to reducing the size of fc required for fc-induction ver- 
ification. We follow two approaches to do so. First, we extend the timed 
automata model so that real-time systems containing both synchronous 
and asynchronous components can be described more easily. We call these 
Synchronizing Timeout Automata (STA). Introducing synchrony often re- 
duces the size of k required. Second, we optimize the semantics so that 
the constructed transition system includes no time transitions; all transi- 
tions are ones in which discrete state is updated. This can greatly reduce 
the depth at which fc-induction must be applied to prove a given safety 
property. We also describe a means by which to model conservatively 
event-triggered physical behavior as in a time-triggered behavior. Such a 
model may contain significantly fewer state transitions than the physical 
system contains. Both kinds of optimizations are necessary to complete 
the verification of the reintegration protocol. 

Organization In Sect. 2, we describe the SAL toolset and the fc- 
induction proof technique in SAL. In 3, we describe Dutertre and Sorea’s 
timeout automata. We then define synchronizing timeout automata (STA) 
and we present a STA model of the train-gate-controller, a canonical ex- 
ample of a real-time system. We describe the SPIDER Reintegration 
Protocol in Sect. 4, and in Sect. 5, we describe how the protocol is mod- 
eled as a timeout automaton in SAL. Additionally, we describe how we 
modeled event-triggered behavior as time-triggered behavior to ease the 
verification. In Sect. 6, we describe the verification of the protocol, and 
concluding remarks are given in Sect. 7. 


2 Symbolic Analysis Laboratory (SAL) 
and fc-induction 

This protocol was specified and verified in the Symbolic Analysis Labo- 
ratory (SAL) [BGL + 00, SRI04], developed by SRI, International. SAL 
is a verification environment that includes explicit-state, symbolic, and 
bounded model checkers, an interactive simulator, as well as other tools. A 
single language serves as the input to the verification tools. The language 
includes a type system, quantification over finite domains, uninterpreted 
constants and functions, and synchronous and asynchronous composition 
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operators. SAL may be downloaded at [SRI04], free of charge, for non- 
commercial use. 

The verification tools used here were SAL’s bounded model checker in 
conjunction with the Integrated Canonizer and Solver (ICS), a decision 
procedure for a quantifier-free, first-order theory of equality, the terms of 
which include uniterpreted functions, linear arithmetic, products, arrays, 
fixed-sized vectors, etc. [dMOR + 04], Although ICS is the default decision 
procedure in SAL, other decision procedures such as UCLID, CVC, and 
SVC may be used [dMOR + 04]. 

Together, these tools can be used to prove state invariants hold in 
infinite transition systems. The invariants do not need to be strictly in- 
ductive; SAL supports k-induction, also known as temporal induction , 
a generalization of the ordinary induction principle (over transition sys- 
tems) [SSSOO, ES03]. Let (S, S ° , —*) be an unlabeled transition system 
where S' is a set of states, S° C S is a nonempty set of initial states, and 
— >C S x S is a transition relation. A 0 -trajectory (over the transition 
system) is a state s. For k G N 0< , a k-trajectory is a sequence of states, 
So , Si, . . . , Sfc, such that for 0 < i < k, Si — > Sj+i. Then the fc-induction 
principle is as follows. 

Definition 1 (fc-Induction Principle). Let k G N 0< , and let P : S -> 

bool be some predicate defined over states of S. 

• Base Case : For all 0 < j < k, show that for each j-trajectory 
so, si, . . . , Sj such that so G S°, P{sj) holds. 

• Induction Step: Show that if so, si, . . . , Sk - i is a (k— l)-trajectory, 
and for all 0 < j < k, P(sj) holds, then for all Sk G S such that 
Sfc— i — ► Sk, P(s k ) holds. 

Property P is a k-inductive property with respect to (S, S°, — ►) if there 
exists some k G N 0< such that P satisfies the fc-induction principle. The 
ordinary induction principle is the special case when k = 1. The benefit 
of fc-induction is that as k increases, weaker invariants may be provable. 
The problem of discovering sufficiently strong inductive invariants can be 
exceedingly difficult, and more often than not, a desired invariant is too 
weak to be proved with the ordinary induction principle. Discovering 
sufficiently strong inductive invariants is an active area of research [HS96, 
RusOO] . 

Furthermore, SAL allows state invariants to be used as lemmas to 
support fc-induction. An invariant has the effect of strengthening the an- 
tecedents in the base case and induction step so that only states satisfying 
the invariant are considered. That is, if Q is an invariant over states, then 
the principle is as stated in Def. 2. 

Definition 2 (fc-Induction Principle with Inductive Invariants). 

• Base Case : For all 0 < j < k, show that for each j-trajectory 
so, si, . . . , Sj such that so G S° holds and for each 0 < i < k, Q(si ) 
holds, P{ Sj ) holds. 

• Induction Step : Show that if so, si, . . . , Sk-i is a (A: — l)-trajectory, 
and for all 0 < j < k, Q(sj) and P{sj) hold, then for all Sk G S such 
that Sk - 1 — > Sk and Q(sk), P(sk) holds. 


5 


Multiple invariants may be simultaneously used by taking their con- 
junction to be the invariant. 

Other systems such as NuSMV [CCO + 04] implement fc-induction (its 
implementors call it “een-sorensson” ) via bounded model checking. How- 
ever, the author knows of no other tools that integrate fc-induction with 
decision procedures to verify infinite-state systems. 


3 Timed Systems in SAL 

Dutertre and Sorea explore the verification of infinite-state timed transi- 
tion systems via fc-induction in SAL [DS04bj. They first consider speci- 
fying these systems as timed automata [Alu99], one of the most promi- 
nent formalisms for the specification and verification of real-time systems, 
automata are a well-known Although they demonstrate that it is possi- 
ble to specify timed automata in SAL via a shallow embedding (i.e. , a 
timed automata is manually transcribed into a semantically-equivalent 
SAL specification), it proves to be unwieldy [DS04b]. The SAL language 
is rich, but it is a general-purpose tool for specifying composed state ma- 
chines; neither the syntax nor the semantics of the language match those 
of timed automata particularly well. In particular, the clock variables in 
timed automata may be updated in arbitrarily small increments leading 
to infinite trajectories in which the discrete state idles. This makes proof 
by fc-induction difficult and sometimes impossible. 

This motivated their development of another theoretical model in 
which to represent timed transition systems that is more amenable to 
general-purpose verification environments in which composed state ma- 
chines can be specified, particularly for verification by fc-induction. A 
timeout automaton is another means by which to specify timed transi- 
tion systems. 3 Timeout automata were motived by the model of system 
execution used in discrete-event simulation [BI84] . 

In [DS04b, DS04a], Dutertre and Sorea provide the semantics of a 
timeout automaton in terms of a transition system. Fix a set of state 
variables V . An additional variable t, ranging over the nonnegative reals, 
records the current time. There is also a set of timeout variables T, ranging 
over the nonnegative reals. A state in the transition is a function mapping 
each variable to some value from the set over which it ranges. For any 
initial state p, p(t) < p(x) for all x £ T. Like in in the definition of 
timed automata, there are two sorts of transitions. The two kinds of 
transitions are time progress transitions and discrete transitions. A time 
progress transition is enabled in a state if and only if for all x € T, 
p(t ) < p(x). In this case, the state changes by updating p(t) to the least- 
valued timeout (there may be multiple timeouts that are least-valued) . 
Discrete transitions are enabled in a state if and only if there exists some 
timeout x such that p(t) = p(x). Furthermore, the following conditions 
must hold for a discrete transition from state p to p': 

• p’(t ) = p(t) ; 

• for all x £ T, p'(x) > p(t) ; 

3 We use ‘automata’ to refer to syntax, distinct from the semantics for automata. 
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• there exists y G T such that p(y) = t and p'(y) > p(t) . 

The third condition prevents infinite zero-delay state transitions. If mul- 
tiple discrete transitions are enabled in a state, exactly one is nondeter- 
ministically applied. Note, too, that discrete transitions are instantaneous 
(i.e., the current time is not updated during their application). 

An important distinction between timeout automata and formalisms 
like timed automata is that in a timed automaton, clocks measure how 
much time has elapsed since their last reset, whereas timeouts measure 
how much time will elapse until the next state transition. Very loosely 
speaking, timeout automata and timed automata are dual with respect to 
their perspective of time. 

Although timeout automata were initially motivated by the desire 
specify and verify real-time systems in SAL for fc-induction verification, 
they are of significant interest in their own right. Results obtained by 
Dutertre and Sorea in specifying and verifying the startup algorithm for 
the Time-Triggered Architecture (TTA) using timeout automata in SAL 
suggest timeout automata specifications in SAL may be superior to those 
attainable in Uppaal, a timed automata model checker [LPY97], although 
a direct comparison is not made [DS04b] , J By some measures, a timeout 
automaton has a simpler semantics than does a timed automaton and 
may allow for a convenient simple embedding of real-time system models 
in other general-purpose model checkers. In any event, timeout automata 
are another formalism by which to specify real-time systems that has 
proved useful in the verification of non-trivial protocols via fc-induction. 
Theoretical comparisons between timeout automata and other real-time 
formalisms is important future work but not our present goal. 

3.1 Synchronizing Timeout Automata (STA) 

The following definitions build upon the timeout automata semantics de- 
veloped in [DS04b, DS04a] and described above. We define a syntax, 
semantics and composition as follows. We call this the Synchronizing 
Timeout Automata (STA) model. The STA model is motivated by the de- 
sire to provide a succinct specification and efficient semantics for systems 
that synchronize both with respect to events (e.g., message passing) and 
with respect to time (e.g., time-triggered [Rus99, Kop97] behavior). For 
example, the train-gate-controller (TGC), is a canonical example of such 
a real-time system [Alu99]. In [DS04b], it is modeled as the asynchronous 
composition of timeout automata in which synchronous communication is 
modeled by the sequential application of edges with the same label. We 
arguably provide a timeout automata model with a semantics that more 
closely resembles its intended semantics. 

As noted, timeout automata were motivated by Dutertre and Sorea’s 
desire to specify timed systems amenable to fc-induction, particularly in 
SAL. We found their development of timeout automata to be a break- 
through in this respect. Nevertheless, fc-induction proofs have a complex- 

4 A meaningful comparison of the formalisms might be difficult given that different tools 
and formalisms are also used, and indeed, fc-induction is not implemented in many other 
systems. 
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ity that is exponential with respect to k (by solving the equivalent boolean 
satisfaction problem). The initial timeout automata models of the SPI- 
DER Reintegration Protocol required fc-induction at infeasible depths. 
Due to the size of the model, fc-induction proofs for k > 4 were often 
infeasible for even a small number of modeled nodes. By allowing both 
synchronous and asynchronous composition, we can markedly reduce the 
depth required for proofs by fc-induction since in a synchronous composi- 
tion, multiple edges may be applied simultaneously. 

We use the train-gate-controller (TGC) to illustrate this. In [DS04b], 
a simple safety property is proved using fc-induction, for k = 14 when the 
TGC is modeled with the (asynchronous) timeout automata semantics 
described in Sec. 2. In Sec. 3.2, we prove the same property with k = 9 
in a synchronous timeout automata model described below. When the 
optimization in Sec 3.3 is also applied, the property is provable for k = 
5. This allows significantly more complex systems to be verified via k- 
induction without having to strengthen the invariant, and it was necessary 
to complete the verification of the reintegration protocol. 

Syntax The definition of a synchronizing timeout automaton (STA) is 
as follows: 

Definition 3 (STA Syntax). A synchronizing timeout automaton STA 
is a tuple (V, M, /, E), where 

• V is a nonempty finite set of state variables. f V is the set of all pos- 
sible total assignment functions that assign values (from the respec- 
tive sets over which the variables range) to these variables. These 
functions are called states. We use variables /, g, h, and i to denote 
states. 

• M C 2 V is a nonempty set of subsets of state variables that cover 
V (i.e., for each v G V, there exists m G M such that v G m). 
For m G M, the set fV m = {/ t m \ f G fV} is the set of states 
restricted to variables in m. An element / m G fV m is the m timeout 
component or m-component of state /. 

• I is a set of initial states and associated timeouts. A timeout is 
associated with each m G M. A timeout ranges over the set of 
nonnegative reals, denoted by R 0 -. The set of all possible timeout 
vectors is TO = {a | a : M — > R 0 -} (we use lowercase Greek letters 
to denote timeout vector variables). The relation I C fV x TO 
relates initial states to initial timeout vectors. 

• E is a set of edges for each timeout component. For m G M, let 
TOm = {a r m | a : M — > R 0 -} be the set of possible timeout vec- 
tors restricted to subsets of m (we use subscripted lowercase Greek 
letters to denote restricted timeout vector variables). An element 
a m G TOm is an m-timeout. vector. 

Then for each m G M, E m C fV m x TOm x fV m x TOm is an 
edge relation. E m relates a current m-component and m-timeout 
vector to an updated m-component and m-timeout vector. An edge 
(fm, OLm , gm, /3m) is called an m-edge or an edge for m. 


Remark 1 (Timeouts and Timeout Components) . A timeout component 
represents a portion of the state that updates synchronously. The notion 
of a timeout components is orthogonal to that of a single state machine 
in a composition. For example, if one automaton sends another a time- 
triggered message and the automata synchronize on that message, the 
state variables of the two automata are in the timeout component trigger- 
ing the message (see Sec. 3.2, for an example). A synchronous distributed 
system [Lyn96] can be represented by letting M be a singleton set con- 
taining V. 

In general, if an edge updates a timeout nondeterministically, it is 
updated to some value over a continuous interval on the nonnegative reals. 

Semantics We require that a STA satisfy the following property to 
provide a semantics. It ensures that if edges for different timeout com- 
ponents are simultaneously applied, they agree on how to update shared 
variables. 

Definition 4 (Synchronous Update Property). For all m, n £ M 

where m ^ n and m n n ^ 0, if there exist edges E m (f m , a m , g m , (3m) 
and E n (f n , a„, h„, 7 n), then g m \ n = h n \ m, and (3 m \ n = 7 „ \ m. 

The semantics are then as follows. 

Definition 5 (STA Semantics). Let STA = (V, M, I, E) be a timeout 
automaton that satisfies the Synchronous Update Property. Its semantics 
is an unlabeled transition system Ssta ■ A state of Ssta 5 is a tuple (/, a, t) 
consisting of a state / £ fV, a timeout vector a £ TO, and a dock, 
t £ R 0 -. The tuple (/, a, t ) is an initial state of Ssta if and only if 
(/, a) £ / and t = 0. 

Let (/, a, t) and ( g , (3, t') be states. There is a time progress transition 
(/, a) -4- (g, (3) if and only if t < min(a), t' = min(/3), g = f, and (3 = a. 

To specify discrete transitions, the following definitions are of assis- 
tance. In the state (/, a, t), E m (hm, 7 m , i m , S m ) is an enabled edge if 
and only if h m = f \ m, 7 m = a \ m, and a(m) = t. An m-component 
is an enabled timeout component in (/, a, t) if and only if there exists 
an m-edge enabled in that state. Furthermore, if (/, a, t) and (g, (3, t') 
are states, then E m (h m , 7 m, im, S m .) is applied in the discrete transi- 
tion (/, a, t) -5- ( g , (3, t') if and only it is an enabled edge in (/, a, t), 
i m = g \ m, and 5 m = (3 \ m. 

Then the discrete transition (/, a, t) (g, (3, t') holds if and only if 
for every m £ M such that m is an enabled m-component in (/, a, t ), 
there exists some m-edge that is applied. 

Remark 2 (Minimum Timeouts). Note that an edge E m (fm, a m , g m , P m ) 
will never be applied if a m (m) ^ min(a m ). 

Remark 3 (Nonzeroness and NonZenoness). Additional properties are re- 
quired for execut ability. The Nonzero Property ensures that timeouts are 
never updated to values in the past, and at least one timeout is updated to 
some time in the future. This prevents infinite discrete state transitions 

5 Context distinguishes whether we speak of the states in fV or the constructed states of 
the transition system. 
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with no time progress. For all edges E m {f m , a m , g m , Pm), min (p m ) > 
min(a m ), and there exists n G M such that /3 m (n) > min(a m ). Note 
that this does not prevent an edge from updating a timeout to some time 
sooner that its current value. 

The nonZeno Property [AL94] ensures that an infinite number of tran- 
sitions are not enabled within a finite interval of time. This property 
should also be satisfied if a specification is to be implementable. 

Composition Two STA are composed by taking the union of their 
state variables, timeout components, and edges. The initial states of the 
composition is defined as the set of states satisfying the initial conditions 
of each automata. 

Definition 6 (Composition). Let STA 1 = (V 1 , M 1 , I 1 , E 1 } and STA 2 = 
( V 2 , A/ 2 , I 2 , E 2 ). Their composition, denoted STA 1 || STA 2 , is the 
STA (V 1 U V 2 , M 1 U M 2 , /, E 1 U E 2 ), where (/, a) £ I if and only if 
(/ ( V 1 , a ( A/ 1 ) € I 1 , and (/ \V 2 ,a\ M 2 ) € I 2 . 

Remark 4 (Compositional Specifications). The specification of a STA is 
somewhat orthogonal to the notion of composed state machines. Because 
timeout components include state variables from communicating state 
machines, in practice, state machines are not specified separately as STAs 
and then composed. 

3.2 STA Model of the Train-Gate-Controller 

The train-gate-controller (TGC) is a canonical example of a real-time 
system. It models the interaction of a train, a gate, and a gate controller 
at a railroad crossing. (For simplicity, assume there is one train on a 
circular track that may repeatedly approach the crossing.) Initially, the 
train is out of the crossing, and the gate is up. The train signals its 
approach to the controller, and after a delay of exactly one unit of time, 
the controller signals the gate to lower. Once the gate has been signaled, 
it takes no more than 1 unit of time to lower. It takes more than 2 and 
no more than 5 units of time from the time the train signals its approach 
until it enters the crossing. Furthermore, it must exit the crossing within 
5 units of time from when it signals its approach. When the train signals 
its exit to the controller within one unit of time of receiving this signal, the 
controller signals the gate to raise. The gate takes at least 1 and no more 
than 2 units of time from when it is signaled to raise until it is completely 
up. As soon as the train has exited, it may approach the crossing again. 

This behavior is modeled as a timed automaton in Fig. 1, by compos- 
ing timed state machines, as presented in [Alu99]. The train, gate, and 
controller state machines each begin in states to, go, and Co, respectively. 
Their clock variables are x, y, and z, respectively, and they are assumed 
to be synchronous. Clock constraints at the verticies denote the time by 
which the state must be left. Clock constraints at the edges constrain 
when the edge in enabled, and clocks may also be reset when a transi- 
tion is taken. A transition is nondeterministically taken at some time 
satisfying the constraints. Edges are labeled. If edges from distinct state 
machines share a label, transitions on these edges must be synchronized. 
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Figure 1: The Train-Gate-Controller 


For example, when the train state machine is in state to and the con- 
troller state machine is in state Co, they must transition to states t\ and 
C 2 , respectively, simultaneously. The representation in Fig. 1 is based on 
a timed automata model of the TGC. A full description with a formal 
syntax and semantics of the TGC modeled as a timed automaton can be 
found in [Alu99]. 

STA Model of the TGC Following Def. 3, the TGC is modeled as 
a STA (V, M, I, E), informally described as follows. 

• V : There are five main state variables. The variable St ranges over 
the state labels for the train (to, 1 1 , etc.); variables s g and s c simi- 
larly range over the labels for the gate and controller, respectively. 
The variable msg t ranges over { approach , exit, null}, the messages 
the train sends to the controller (the null message denotes the lack 
of a message being sent). Likewise, the variable msg g ranges over 
{lower, raise, null}, the messages the controller sends the gate. All 
of the messages that are not sent between machines are irrelevant in 
the STA model). 

• M\ The set M contains three elements, mt, m c , and m g . Each of 
these sets contain the state-label variables and message variables for 
a machine, and if that machine outputs messages to another one, 
it contains the state-label variables for that machine, too. Thus, 
m t = {s t , msg t , s c }, m g = {s 9 }, and m c = {s c , msg c , s 9 }. 

• /: The state-label variables are initially set to to, go, and Co, re- 
spectively. The messages variables are initially set to null. Initially, 
timeouts may have any value, but note that some initial states lead 
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to deadlock (e.g., if m g initially has the strictly least- valued time- 
out). 

• E: For each timeout component, the edges update the state labels 
and timeouts in that component according to the constraints de- 
scribed. Consider, for example, an edge for the mt-component in 
which the train and controller synchronize on the approach mes- 
sage. For such an edge E mt (f mt , a mt , g mt , /3m t ), fm t {st ) = to and 
/m t (sc) = Co ( msg t may have any value). In the updated state, 
3 m t (s t ) = fi, gm t (sc) = ci, and gm t {msg t ) = approach. The up- 
dated timeouts are those associated with mt and m c ; they are non- 
deterministically updated to satisfy the constraints a rnt (mt) + 2 < 
/ 3m t (mt ) < am t (mt) + 5 and = a mt (m c ) + 1, respectively. 

Remark 5 (Timeout Vs. Timed Automata). Note that unlike in the timed 
automata formalization, clocks are not reset. Timeouts continue to in- 
crease indefinitely, but they are required to satisfy the constrains given 
the current time. For example, if t is the current time, upon entering 
state 1 1 , the timeout for mtc is nondeterministically updated to some 
value greater than t + 2 and less than or equal to t + 5. 

TGC Semantics in SAL The specification of the TGC in SAL mod- 
ifies the model developed by Dutertre and Sorea in [DS04b], and is pre- 
sented in Appendix A. Because SAL is a general-purpose specification and 
verification environment, it does not automatically generate the semantics 
of an STA from its syntax. Therefore, we describe a shallow embedding of 
the STA semantics in the language of SAL. A specialized language is not 
required to describe these semantics. Embedding the semantics provides 
a great deal of flexibility; in particular, it allows the user to optimize the 
semantics; we describe one such possibility in Sec. 3.3. 

The basic building block of a SAL specification is a module containing 
global, input, output, and local variables. In a module, transitions are 
specified by guarded commands in which local and output variables may 
be updated. Modules may be either synchronously or asynchronously 
composed, and they communicate via shared variables. 

To implement the semantics, modules are specified for the train, gate, 
and controller. Each of these contains output variables for their state 
labels and outgoing messages, and if a module receives messages from 
another, those are specified as input variables. Each also has an output 
timeout variable. Additionally, there is a module that specifies the global 
clock, which outputs the current time. The other modules have an in- 
put variable to read the current time. Because edges may simultaneously 
update variables from multiple modules (e.g., in a synchronized transi- 
tion between the train and controller), the train, gate, and controller are 
synchronously composed. In each of the guards for the transitions in the 
train, gate, and controller modules is a condition that the relevant time- 
out is equal to the current time. The composition of the train, gate, and 
controller is asynchronously composed with the clock module. The clock 
module has a single transition that updates the clock when the current 
time is less than all the timeouts, and it is updated to the minimum of 
the timeouts. 
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Verification The following is a typical safety property one might wish 
to prove about the TGC: if the train is in the railroad crossing (st = f 2 ), 
then the gate is down ( s g = ( 72 )- In model of the TGC employing asyn- 
chronous timeout automata semantics described in Sec. 3, the property 
can be proved by fc-induction when k = 14. With the STA semantics 
defined, the property is proved when k = 9. 

3.3 Clockless STA Semantics 

We describe an optimization to the STA semantics provided in Def. 5 in 
which we describe how to remove the global clock from the semantics. 
By applying this optimization, we are able to reduce the depth at which 
fc-induction must be applied to prove safety properties about timeout 
automata. For example, for the TGC, a basic safety property of the model 
is that whenever the train is in the railroad crossing, the gate is down. 
In the original timeout automaton model, this is proved in SAL by k- 
induction at depth 14 [DS04b] . After applying the optimization described 
here, this depth is reduced to k = 5. In Appendix B is a STA model of 
the TGC after applying the optimization. This optimization was essential 
to complete the verification of the reintegration protocol. 

In a timeout automaton, the essential purpose of the clock is to record 
the least-valued timeouts of the automata. That is, the clock is either 
equal to the least- valued timeout (s), or it is equal to the least- valued 
timeout(s) in the next state. However, this information can be obtained 
from the timeouts themselves; the clock variable is unnecessary. Removing 
the clock variable reduces the state space. Each time the timeouts are 
updated so that no timeout is equal to the current clock time, a transition 
is taken in which only the clock variable is updated. In the worst case, 
this can double the value of k required to prove a state invariant via 
fc-induction. If the number of timeouts is large, then state transitions 
overshadow clock transitions. 

Finally, removing the time transitions simplifies the semantics insofar 
as only one kind of transition need be considered. In most formalisms for 
specifying real-time systems, the semantics included both time and state 
transitions. 

Definition 7 (Clockless STA Semantics). Let STA = ( V , M, I, E) 

be a timeout automaton that satisfies the Synchronous Update Property. 
Its semantics is an unlabeled transition system Sg^A- A state of Ssta is 
a pair (/, a) consisting of a state / £ fV and a timeout vector a £ TO. 
A state (/, a) is an initial state of Sg^A if and only if (/, «} £ I- 

In the state (/, a), the edge F m (h m , 7 m , z m , <5 m ) is an enabled edge 
if and only h m = f \ m, 7 m = a \ m, and a(m ) = min(a). An m- 
component is an enabled timeout component in (/, a ) if and only if there 
exists an m-edge enabled in the state. Furthermore, if (/, a) and (g, /3) are 
states, the edge 7 m , i m , 5 m ) is applied in the transition (/, a) — > 

( g , (3) if and only it is an enabled edge in (/, a) , i m = g \ m, and <5 m = 
/3 I" m. 

Then the transition (/, a) — > (g, /3) holds if and only if for every 
m £ M such that m is an enabled m-component in (/, a), there exists 
some m-edge that is applied in the transition. 
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The following proposition asserts that the same state invariants are 
true under both clockless semantics and the semantics in Def. 5. 

Proposition 1 (Clockless Simulation). Fix a STA { V , M, /, E). Let 

its semantics from Def. 5 be the transition system Ssta, and let its clock- 
less semantics be the transition system Ssta- Let P be some predicate 
defined over the states of Ssta that does not take the clock variable t as 
an argument. Then P holds for all reachable states in Ssta if and only if 
it holds for all reachable states in Ssta ■ 

Proof. By induction over the states of Ssta and Sc-ta, the state (/, a) 
is reachable in Ssta , if and only if either (/, a, 0) or (/, a, min(a)} is 
reachable in Ssta- □ 

Example 1 (TGC with Clockless STA Semantics). Removing the clock is 
straightforward. In SAL, this essentially amounts to removing the module 
that specifies the global clock, as described in Sec. 3.2. The specifications 
of the train, gate and controller must then be modified slightly: rather 
than comparing timeouts against the global clock to determine whether 
an edge is enabled, timeouts are directly compared with one another. The 
full SAL specification is in Appendix B. 

Remark 6 (k-Induction in Clockless Semantics). By removing the global 
clock, we are able to decrease the depth of fc-induction to prove the safety 
property described in Sec. 3.2 from 14 under the original timeout automata 
semantics to k = 9 in the STA semantics to k = 5 in the clockless STA 
semantics. The benefit of removing the clock is particularly pronounced in 
the TGC model because there are only three timeouts. Thus, the number 
of transitions in a path devoted to updating the clock is relatively high. 
In a system with many more timeouts, this effect is less pronounced. For 
example, applying these modifications to the timeout automata model of 
the Fischer Mutual Exclusion Protocol described in [DS04b] for a large 
number of processes would yield more modest results. 


4 The Reintegration Protocol 

The protocol described here abstracts the reintegration protocol being de- 
signed for the latest SPIDER prototype [TPMM05] . The most significant 
abstraction is that we model only the portion of the protocol in which the 
reintegrator resynchronizes its local clock with the clocks of the nodes in 
the clique. We omit that portion of the protocol in which the reintegrator 
regains diagnostic data consistent with the operational clique. This por- 
tion of the protocol is a slight modification of the SPIDER Distributed 
Diagnosis Protocol (the main difference being that the reintegrator simply 
listens but does not broadcast messages as in the full distributed diagno- 
sis protocol) [TPMM05]. The Distributed Diagnosis Protocol has been 
formally verified in PVS [ORSvH95] , and the protocol and its verification 
is described in [MGPM04]. 

From a pragmatic standpoint, resynchronization during reintegration 
is the most complex portion of the protocol and stood to benefit the most 
from formal analysis. Once reintegration is achieved, the remainder of 
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the protocol can be modeled as being synchronous, substantially easing 
its analysis. 

Other minor simplifications include, for example, not modeling timers 
signaling massive failure (e,g., where there is no clique with which to 
reintegrate) that triggers the reintegrator to stop executing the reintegra- 
tion protocol and begin executing a reset protocol. The protocol, as it is 
described in the remainder of this section, is fully modeled and verified. 

During the reintegration protocol, the reintegrator monitors its com- 
munication links for echo messages (or simply echos ) sent by the other 
nodes. Echos are messages sent by nodes during the SPIDER Clock Syn- 
chronization Protocol, a fault-tolerant protocol in which nodes synchronize 
their local clocks that may have drifted (this protocol and its formal ver- 
ification are also described in [MGPM04]). The clock synchronization 
protocol must be executed periodically by all operational nodes because 
clock drift is inevitable, even in operational nodes. The period beginning 
at the conclusion of one execution of the synchronization protocol lasting 
until its next execution is called a resynchronization frame or simply a 
frame. 

We verify the correctness of the reintegration protocol with respect to 
a single reintegrating node. During the reintegration protocol, the rein- 
tegrator sends no messages. If multiple reintegrators are executing the 
protocol, they receive no messages from each other, assuming they are 
non-faulty. Although a reintegrating node may be non-faulty, it will be 
considered faulty by other nodes simultaneously reintegrating. In partic- 
ular, a reintegrating node will diagnose another as suffering a fail-silent 
fault, since it receives no messages from it. 

The reintegrator is designed to tolerate the full range of faulty behav- 
iors, including Byzantine faults [LSP82] , manifested as arbitrary behavior 
to respective observers. However, because the reintegration protocol is not 
a distributed protocol (i.e., only a single node executes it), the only fault 
manifestations detectable by the reintegrator are benign faults, detectable 
in point-to-point communication [PMMG04]. 

Finally, note that the ability of the reintegrator to reintegrate success- 
fully with the operational clique depends on the behavior of the nodes in 
the operational clique as well. In particular, the reintegrator executes the 
reintegration protocol after suffering a transient fault and reseting. During 
this period, the operational nodes have likely determined the reintegrator 
to be faulty. So long as the operational nodes believe the reintegrator to 
be faulty, they will ignore it, even if it resynchronizes and regains correct 
local state. Thus, to allow for reintegration, the operational nodes must 
periodically purge its diagnostic data to allow nodes a chance to reinte- 
grate. In the current SPIDER prototype under development [TPMM05], 
the non-faulty nodes purge their diagnostic data at the end of each resyn- 
chronization frame. This allows a node that has suffered a fault in one 
resynchronization frame to successfully reintegrate in another. 

4.1 System Assumptions 

Before describing the behavior of the protocol, a preliminary understand- 
ing of the system assumptions is required. These properties are stated in 
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Figure 2: The Frame Property 


terms of accusations made by the reintegrator. The reintegrator accuses 
a node when it believes the communication from the node is inappropri- 
ate (i.e., the reintegrator does not receive an echo message when one is 
expected or it receives one unexpectedly). 

The first property constrains the behavior of the operational nodes 
during each resynchronization frame. It is guaranteed by the correct- 
ness of the clock resynchronization protocol [MGPM04] and the high-level 
scheduling of the protocols. It is illustrated in Fig. 2. 

Definition 8 (Frame Property). Let { t n }“ be a sequence of nonneg- 
ative reals (denoting real time) assumed to have the following properties: 
for all n £ N, t n + 1 > t n and t n +i — t„ = P. The constant P is called 
the frame length, and for each n, the interval [t n , t n + 1 ), closed on the left 
and open on the right, is the nth frame. P is constrained as follows: let l 
be the number of faulty nodes not accused by the reintegrator during the 
preliminary diagnosis and frame synchronization modes (to be described 
shortly). Then P > ln + 2n. The constant n £ R 0< and is called the skew 
constant. The reintegrator receives exactly one echo message from each 
operational node during each open interval (t n — n, t n ), 6 and no more 
than one echo message in each frame. 

The next property ensures that enough of the monitored nodes that 
have not been accused are non-faulty for the protocol to work. 

Definition 9 (Majority Property). Of the nodes that have not been 
accused by the reintegrator during the entire protocol, the majority are 
operational. 

4.2 Protocol Description 

The reintegration protocol is comprised of three modes of operation: pre- 
liminary diagnosis, frame synchronization, and synchronization capture. 
These modes are executed sequentially in as shown in Fig. 3. We itemize 
the global and local state variables of the modes, and then we describe 
the behavior of the protocol during each mode. 

4.2.1 State Description 

State Variables The following state variables of the reintegrator de- 
termine the state of the reintegrator during the execution of the protocol. 

6 In this model, we include communication error in the skew. 
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Figure 3: State Machine Model of the Protocol Mode Control 


In the following, let i range over the indices of the nodes the reintegrator 
monitors. 

• accs is an array of boolean values such that for each node i, accs[i\ is 
true if the reintegrator accuses node i of being faulty and it is false 
otherwise. The reintegrator ignores echos from nodes it has accused. 

• clock is the current time of the reintegrator’s local clock. 

• fs-finish ranges over the nonnegative reals and is a timer variable 
used in the frame synchronization mode. 

• mode records the current mode being executed. It ranges over the 
set {prelim-diag , framesynch, synch -capture} , denoting the three 
modes, respectively. 

• pd -finish ranges over the nonnegative reals and denotes the time at 
which the preliminary diagnosis mode completes. 

• seen is an array of natural numbers such that for each node i, seen[i] 
records the number of times a message has been received from i. 

State Initialization The following state variables are initialized at 
the beginning of the reintegration protocol. 

for each i, accs[i] := false; 
mode := prelim _diag ; 
for each i, seen[i] := 0; 


4.2.2 Protocol Behavior 

Preliminary Diagnosis When the reintegrator begins executing the 
reintegration protocol, it has no diagnostic data to use in deciding which 
nodes are faulty and which are not. Trusting too many faulty nodes 
may lower the probability that it will successfully reintegrate with the 
operational clique. The purpose of preliminary diagnosis is to acquire 
preliminary diagnostic data to attempt to recognize faulty nodes early 
in the protocol. This is achieved by monitoring echo messages for the 
duration P + n. The reintegrator expects to receive at least one and no 
more than two echo messages from i. 

In the following pseudo code, a when statement is a guarded action. 
The guard echo(i) is true precisely when the reintegrator receives an echo 
message from node i. 






pd -finish := clock + P + n; 
while clock < pd_finish do { 
for each i, when echo{i ) do { 
if ( seen[i ] < 2 and not accs[i ]) 
then seen[i\ := seen[i\ + 1 
else accs\i\ := true ; 

}; 

}; 


for each i, if seen[i ] = 0 then accs[i\-, 
mode := frame synch-, 


Frame Synchronization The purpose of the frame synchronization 
mode is to determine a time such that all operational nodes have already 
issued an echo message in some frame and before any operational node 
issues an echo in the next frame. An interval satisfying this property is 
referred to as a frame gap. This provides the reintegrator with a coarse- 
grained synchronization with the operational clique: a reintegrator is able 
to separate echo messages from operational nodes arriving in different 
resynchronization frames. 

The mode relies on echo messages from operational nodes being sep- 
arated by no more than n units of time. Therefore, the mode begins 
monitoring for echos, and it exits when 7r units of time have elapsed such 
that no echo is observed from a node that has not yet been accused. If an 
echo is observed within that time from a node that has not be accused, 
then the timer is reset. 

Acquiring this course-grained level of synchronization is a precondition 
for the actual resynchronization that occurs in the next mode. 

for each i, seen\i\ := 0; 
fs -finish := clock-, 
while clock — fs -finish < tt do { 
for each i, when echo{i) do { 
if (seen[i ] = 0 and not accs[i ]) 
then { 

fs -finish := clock ; 
seen[i] := seen[i] + 1; 

}; 

else accs\i\ := true-, 

}: 

}; 

mode -.= synch -capture-, 


Synchronization Capture The synchronization capture mode is the 
final mode of the reintegration protocol. Its purpose is to allow the reinte- 
grator to determine a time during which some operational node issues an 
echo message. It does so by synchronizing when it has received echos from 
at least half of the nodes it has not accused (or has not already seen in 
this mode). To ensure that it is synchronizing with an operational node, 
the Majority Property (Def. 9) must hold. If so, the reintegrator will have 
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become resynchronized with the operational clique, within the accepted 
skew, 7r. 

Let trusted be the total number of nodes the reintegrator has not 
accused: trusted= |{i | not accs[i]}|. Let seen_cnt be the number of 
nodes seen (that have not been accused in previous frames): seeri-cnt : = 
|{i | seen[i ] > 0} | . 

for each i, seen[i\ := 0; 

while seen_cnt < trusted /2 do { 
for each i, when echo{i) do { 
if (,seen[i\ = 0 and not accs[i \ ) 
then seen[i\ := seen[i ] + 1; 

}; 

}; 

clock := 0; 

5 Modeling the Protocol in SAL 

We now describe the modeling of the reintegration protocol as a STA with 
clockless semantics. We describe the model in the language of SAL. The 
shallow embedding of the semantics of the reintegration protocol’s STA 
model in SAL is similar to our effort to do the same for the TGC’s STA 
model, as described in Sec. 3.2 and Ex. 1. The full model can be found 
both in Appendix C and on-line at [Pik05] for download. 

5.1 Timeouts 

The model contains the following timeout variables: reint_to, which is 
primarily associated with the reintegrator; frame_to, which is primarily 
associated with the operational nodes; and each faulty node has its own 
timeout. The timeouts for the operational and faulty nodes essentially 
exist for modeling purposes. In modeling the reintegrator’s execution 
of the protocol, we require a model of the entire system’s behavior. A 
naive model would fix the behavior of the monitored nodes over multi- 
ple resynchronization frames a priori. However, the state space required 
to do so makes this infeasible. Rather, we model the behavior of the 
monitored nodes one frame at a time. The frame in which the reintegra- 
tor is presently in is modeled, and if the reintegrator passes into another 
frame by updating reint_to, then the monitored nodes simultaneously 
change to the same frame (of course, the reintegrator is modeled to have 
no knowledge of which frame it is actually in). 

This model allows for a few simplifications. The behavior of the rein- 
tegration protocol depends on that of the observed nodes, but not vice 
versa. Thus, the model can be constructed so that reint_to is always the 
minimum of the other timeouts (this is provable by fc-induction) . This 
ensures the issuing of echo messages are always future events observable 
by the timeout model of the reintegrator. 
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P_update: MODULE = 

TRANSITION 

[ 

frame_to <= reint_to’ 

— > 

frame_to’ = frame_to + P; 
new_frame’ = TRUE 

[] 

ELSE — > 

new_frame’ = FALSE 

] 


Figure 4: Synchronization Frame Module 


5.2 Monitored Nodes 

To verify the correctness of the protocol, we must model both the rein- 
tegrator and the monitored nodes. In the model, we distinguish between 
nodes in the operational clique and faulty nodes (as discussed in Sec. 4, 
non-faulty nodes not in the operational clique are considered faulty by the 
reintegrator, and their behaviors are subsumed by the modeled behavior 
of the faulty nodes). We describe the model of the two kinds of nodes in 
turn. 

5.2.1 Operational Nodes 

To model the operational nodes, we begin by defining a module that keeps 
track of the resynchronization frames, as presented in Fig. 4. The timeout 
frame_to serves as an abstract global clock shared by the synchronized 
operational nodes. The timeout keeps track of the values of t n marking 
the end of a frame, as described in Sec. 4.1. There is a single transition, 
updating the timeout f rame_to on transitions when the timeout reint_to 
has been updated so that its value is in the next resynchronization frame. 
This can be determined by comparing the next state’s value of reint_to 
(denoted by reint_to ’ ) to the end of the current resynchronization frame. 
The variable new_frame is a boolean value that is true if and only if the 
transition just taken was one in which the frame has been updated. 

The operational nodes themselves are specified by an op_node mod- 
ule, parameterized by the indices of operational nodes, presented in Fig. 5. 
The timeout for an operational node is f rame_to. Whenever the frame up- 
dates, it nondeterministically updates its echo variable, op_echo (ranging 
over the nonnegative reals), to a new value satisfying the Frame Property 
(Dcf. 8). This is a conservative model insofar as an operational node may 
update its echo to any time satisfying the constraints, so the difference 
between the echos it issues in adjoining frames may be up to P + n. In 
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op_node [i : 0P_IDS] : MODULE = 


TRANSITION 


[ 


frame_to <= reint_to’ 


> 


op_echo ’ IN {t: TIME 

frame_to’ > t 


AND t > frame_to’ - pi} 

[] 


ELSE — > 


] 



Figure 5: Operational Node Module 


op_nodes : MODULE = 


WITH 

OUTPUT op_echos : 0P_ECH0S 

(1 

(i: 0P_IDS) 

RENAME op_echo TO op_echos [i] 



IN op_node [i] ) ; 

clique 

MODULE = op. 

.nodes 1 1 P_update; 


Figure 6: Operational Clique Module 


reality, the clock of an operational node would not drift so violently. 

To ensure the correctness of the model, when the reintegrator moves 
from one frame to the next, its timeout reint_to’ must never be updated 
so far into the future that it is beyond when operational nodes issues 
echos in the next frame. An invariant is proved about the model that 
demonstrates that this does not occur. 

Finally, the instances of op_node are synchronously composed, and 
this composition is synchronously composed with the P_update module 
as shown in Fig 6. 

5.2.2 Faulty Nodes 

Faulty nodes are also specified by a module parameterized by the in- 
dices of nodes that may exhibit faulty behavior. The model is slightly 
more complicated so that all possible faulty behaviors are modeled, yet k- 
induction proofs are feasible over the transition system. In a naive model 
of the entire system, the reintegrator would make a transition whenever 
it receives an echo from a node it is actively monitoring. This would 
amount to updating its timeout to be equal to the timeout of the first 
echo it receives and updating its state accordingly. It would then reset 
its timeout to the next echo and so on. In this model, the reintegrator’s 
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Figure 7: The Reintegrator TA Misses Echo Messages 


transitions are event- triggered ; they depend on echo events. However, be- 
cause a faulty node may issue multiple echos before being ignored by the 
reintegrator, this model can quickly lead the reintegrator to make a large 
number of transitions for even a relatively small number of faulty nodes. 
For fc-induction to succeed, a more sophisticated model is required. 

A preferable model is one in which the reintegrator’s transitions are 
essentially time-triggered. This amounts to the reintegrator updating its 
timeout irrespective of the states and timeouts of the monitored nodes. 
Ideally, a time-triggered model of the reintegrator would make a small 
number of time-triggered transitions at regular intervals and update its 
state based on all of the echos received during the intervals rather than 
updating its state upon receiving each echo. 

Care must be taken to make a time-triggered model conservative. Be- 
cause timeouts record when future events occur, when the reintegrator 
makes a state transition, it can only “observe” those echos that come af- 
ter its current timeout and no later than the time at which it sets its next 
timeout. For example, in a naive model, suppose the reintegrator were 
to update its state in a time-triggered fashion as illustrated in Fig. 7. 
Suppose reint_to denotes the reintegrator’s current timeout, which is 
also the least of all timeouts. Suppose that for some monitored node, 
it issues an echo message at time echo. The reintegrator observes this 
echo message, and updates its timeout to reint_to’. Once the current 
time reaches echo, however, that node could issue another echo message 
at echo which will go undetected by the model of the reintegrator. 

Therefore, we allow the reintegrator to behave in a time-triggered fash- 
ion (in part), but faulty nodes are able to issue multiple echo messages 
in a single transition. The model of a faulty node contains a state vari- 
able bad_echo, as shown in Fig. 8, that is an array of echos (nonnegative 
reals). The array has three indices. This is because the greatest number 
of echos that must be observed from any node in a mode is three before 
the node is accused. The echo in the first index also serves as the timeout 
for a faulty node, and the remaining echos in the array are guaranteed to 
always be greater than the timeout by the ascending? predicate. If the 
reintegrator updates its timeout in a time-triggered manner, there is the 
possibility it will observe all three echos during the update. 

Nevertheless, there is no upper bound on how large any of the values 
in the array may be. If the echos are too far ahead of the reintegrator’s 
timeout, they will be beyond the time to which it updates its timeout in 
a time-triggered transition and will never be observed. Thus, the module 
also models faulty nodes that are fail-silent. As well, note that the behav- 
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bad_node [i : BAD_IDS] : MODULE = 


TRANSITION 

[ 

bad_echo [1] 
— > 

bad_echo ’ 

□ 

ELSE — > 

] 


<= reint_to’ 

IN {be: BAD_ECHO_ARRAY 

I ascending? (be , reint_to’)} 


Figure 8: Faulty Node Module 


ior of a faulty node so modeled may also be indistinguishable from that 
of an operational one. A faulty node may issue echos such that the first 
echo in the array consistently satisfies the Frame Property (Def. 9), and 
the other echos are beyond the observation window of the reintegrator. 

The precondition for the transition to update the echos is similar to 
that described for the frame synchronization module described in Sec. 5.2.1. 
Here, if the next state’s value of reint_to ever surpasses the first echo 
from a faulty node, all of the faulty nodes echos are updated. Thus, all of 
the faulty node’s echos are always observed by the reintegrator (i.e., the 
values of each echo is greater than reint_to). This is also provable in the 
model by fc-induction. 

5.3 The Reintegrator 

Each of the three modes of the reintegration protocol is specified as a 
separate module. Additionally, another module handles mode control. 

5.3.1 Mode Control 

Each of the three modes has a binary control signal to determine whether 
it is active. Only one mode may be active at any time. The module spec- 
ified in Fig. 9 ensures the correct flow of control through the modes. It 
is synchronously composed with the modules specifying the modes them- 
selves. 

We specify mode control for a number of reasons. Making mode con- 
trol explicit simplifies the analysis of counterexamples generated by SAL 
when attempting to verify properties of the formal model; knowing in 
which mode the counterexample occurs simplifies the search for the er- 
ror. The mode control is part of the protocol as it was designed. Mode 
exit points demarcate locations in the execution of the protocol where 
certain invariants are supposed to be reached. An invariant guaranteed 
upon the completion of a mode serves as an assumption in demonstrating 
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modes: MODULE = 


TRANSITION 

[ 

mode = pd_mode 
— > 

mode’ = IF pd_cntrl=active 

THEN mode ELSE fs_mode 
END IF 

[] 

mode = fs_mode 
— > 

mode’ = IF f s_cntrl=active 

THEN mode ELSE sc_mode 
END IF 

[] 

ELSE — > 

] 


Figure 9: Mode Control Module 


the succeeding mode behaves correctly. Demonstrating that each mode 
guarantees the appropriate invariants is sufficient to demonstrate the en- 
tire protocol behaves correctly. Thus, modes serve as both a conceptual 
and formal decomposition to model and verify the protocol. Because the 
module is synchronously composed with the mode modules, it does not 
affect the trajectory-length required for fc-induction proofs. 

5.3.2 Base Modes 

Because of the distinct way in which operational and faulty nodes are 
modeled, it is simpler to specify distinct accs and seen variables for each 
kind. For example, in the SAL model, the reintegrator contains variables 
op_accs and bad_accs to record accusations. Nevertheless, care is taken 
to ensure that the reintegrator has no a priori knowledge about which 
nodes are in fact operational and which are faulty. 

In addition, in proving invariants, we found it simpler to specify sep- 
arate seen variables for each mode rather than reseting the seen variable 
at the conclusion of each mode. 

The following paragraphs overview the models of each mode’s execu- 
tion. 

Preliminary Diagnosis In the preliminary diagnosis mode, there are 
two principle transitions, as shown in Fig. 10. The first transition models 
the behavior during the mode, and the second models the exiting of the 
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preliminary_diagnosis_mode : MODULE = 

TRANSITION 

[ 

mode ’ = pd_mode 
AND frame_to < pd_finish 
— > 

reint_to’ = frame_to; 

[] 

mode 1 = pd_mode 
AND frame_to >= pd_f inish. 

— > 

pd_cntrl’ = deactive; 
reint_to’ = pd_f inish; 

] 


Figure 10: Preliminary Diagnosis Module 


mode. Our model of the reintegrator during the preliminary diagnosis 
mode is essentially time-triggered. The variable pd_finish marks the 
time at which the mode exits. The effect of a transition is to move the 
reintegrator’s timeout from the beginning of frame n to the beginning 
of frame n + 1. As it does so, it records the echos observed in that 
frame and updates its state variables recording how many echos are seen 
from each node and whether they should be accused, respectively. When 
the reintegrator’s timeout is updated to the beginning of the next frame, 
the P_update module simultaneously updates frame_to to prepare the 
reintegrator to observe the echos in the next frame (Sec. 5.2.1). If the 
mode should exit before the termination of the frame, the reintegrator’s 
timeout is updated to the time at which the mode should end, and only 
those echos in the interium are recorded by the reintegrator. 

Frame Synchronization The purpose of the frame synchronization 
mode is to allow the reintegrator to discover some time during which no 
echos have been observed for n units of time (from nodes it does not 
know to be faulty). Thus, as shown in Fig. 11, we define the relation 
none_in_pi? that determine whether or not this holds. If the relation is 
not satisfied, the reintegrator’s timeout is updated to the greatest echo not 
known to be from a faulty node within 7r units of time of the reintegrator’s 
current timeout within the current frame. If no such echo exists within the 
current frame, reint_to is updated to the beginning of the next frame, 
allowing the operational echos to be updated (see Sec. 5.2.1). When the 
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frame_synchronization_mode 

: MODULE = 

TRANSITION 

[ 

mode ’ = 



f s_mode 


AND NOT 

none_in_pi? (reint_to , op_echos, bad_echos, 



fs_op_seen, fs_bad_seen, 
op_accs, bad_accs) 

— > 



f s_cntrl ’ 

= active; 


reint_to ’ 

IN {t: TIME 



1 last_ 

in_pi?(t, reint_to, 



op_echos, bad_echos, 
op_accs, bad_accs, 
f s_op_seen, 



f s_bad_seen, 



reint_to) } ; 

[] 



mode ’ = 

f s_mode 


AND none_in_ 

.pi? (reint_to 

, op_echos, bad_echos, 


op_accs , 

bad_accs , 


fs_op_seen, fs_bad_seen) 

— > 



f s_cntrl ’ 

= deactive; 


reint_to ’ 

= reint_to + 

pi; 


Figure 11: Frame Synchronization Module 


relation does hold, the reintegrator’s timeout is simply updated to be tv 
units of time greater than its current value. 

Synchronization Capture In the last mode, shown in Fig. 12, we 
allow the reintegrator to behave in an event-triggered fashion. The reinte- 
grator’s timeout is updated from its current value to the time at which the 
soonest echo message occurs (that does not come from a node known to 
be faulty), or if no such echo exists in the current frame, it updates to the 
beginning of the next frame. The function sc_seen_total records how 
many echo messages have been seen so far. The mode exits when more 
than half of the nodes that have not been accused have been observed - 
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synch_capture_mode : MODULE = 

TRANSITION 

[ 

mode’ = sc_mode 

AND sc_seen_total(sc_op_seen, sc_bad_seen) 

<= not_accd(op_accs , bad_accs)/2 

— > 

sC-Cntrl’ = active; 
reint_to’ IN {t : TIME 

I next?(t, reint_to, 

op_echos, bad_echos, 
op_accs, bad_accs, 
sc_op_seen, 

sc_bad_seen, frELme_to)}; 

[] 

mode’ = sc_mode 

AND sc_seen_total (sc_op_seen, sc_bad_seen) 

> not_accd(op_accs , bad_accs)/2 
— > 

sC-Cntrl’ = deactive; 

] 


Figure 12: Synchronization Capture Module 


that is, when 

sc_seen_total (sc_op_seen, sc_bad_seen) 

> not_accd(op_accs , bad_accs)/2. 

This also marks the termination of the reintegration protocol. 

5.3.3 Composing The Modules 

The three mode modules are composed asynchronously, in the base_modes 
module: 

base_modes : MODULE = 

preliminary_diagnosis_mode 

[] 

f rame_synchronization_mode 

[] 

synch_capture_mode ; 
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No two modes should be active simultaneously. This is enforced by 
ensuring that if one mode is active, the others are deadlocked. The reinte- 
grator is then defined as the synchronous composition of the base_modes 
module and the modes module: 

reintegrator: MODULE = base_modes I I modes; 


The entire system is the synchronous composition of the reintegrator 
module, the clique module, and the module of the composition of the 
faulty nodes: 

system: MODULE = reintegrator I I clique I I bad_nodes; 


6 Verifying the Protocol 

There are two main theorems to prove. First, we wish to show that 
the reintegrator accuses no operational nodes during the execution of the 
reintegration protocol. Second, we wish to show that the reintegrator has 
successfully resynchronized with the operational nodes upon completion 
of the reintegration protocol. 

Theorem 1 (No Operational Accusations). For all operational nodes 
i, accs[i] does not hold during the reintegration protocol. 

Theorem 2 (Synchronization Acquisition). For all operational nodes 
i, | clock — echo(i)\ < n upon termination of the reintegration protocol. 

The proofs of these theorems via ^-induction requires a number of 
supporting lemmas. As mentioned in Sect. 5.3.1, our general strategy is 
to decompose the protocol verification into a verification of its constituent 
modes. Each mode should guarantee certain postconditions. The post- 
conditions for a mode then serve as preconditions for succeeding modes. 
This strategy can be followed through the entire protocol making the proof 
of the above theorems straightforward. 

This proof strategy is similar to the proof by abstraction technique 
used in [DS04b, DS04a] and inspired by abstraction techniques described 
in [RusOO, MP94], However, in [DS04b, DS04a], the abstraction was man- 
ually constructed after the fact to aid in the verification. It’s construc- 
tion seemed to require a great deal of understanding about the protocol 
before verifying it. Furthermore, it is a predicate abstraction: a state 
machine is constructed, the states of which are predicates over the pro- 
tocol model. These predicates roughly correspond to the postconditions 
we verify. While the predicate abstraction technique is more powerful, its 
construction is more complicated; the mode abstraction is simply adopted 
from the protocol specification. 

In this respect, a proof by ^-induction can be seen to fall somewhere 
between an inductive invariant approach and a clock function approach 
used in verifying total correctness of transformational programs via me- 
chanical theorem-proving [RM04]. The former approach amounts to in- 
duction over a transition system, while the latter requires one to show 
that from any state satisfying the precondition, the program halts and 
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the postcondition is satisfied after some fixed number of transitions from 
that state. 

The proof of Thm 1 requires showing that no accusations are issued 
in any of the modes; accusations are not issued in the synchronization 
capture mode, so we need be concerned with only the first two modes. 
One challenge in doing so is that the reintegrator is unsynchronized with 
the operational nodes in these modes, so it may begin listening for echos 
at any time during a frame. In particular, it may begin listening for 
echos after some operational nodes have issued them and before others 
have done so. Thus, for example, in the preliminary diagnosis mode, we 
cannot state precisely how many echos messages the reintegrator should 
receive from operational node. Rather, the reintegrator should receive at 
least one and no more than two echos. Proving that this in fact happens 
requires some additional lemmas regarding the maximum and minimum 
length of time the mode is active, and the effects of the mode initializing 
at different points in a frame. 

The proof of Thm 2 relies principally on two supporting lemmas, each 
of which provides preconditions for the mode. The first precondition is 
that no operational nodes have been accused (Thm 1). The second is that 
the time at which the synchronization capture mode initializes and the 
reintegrator begins listening for echos is such that either all operational 
nodes in that frame have already issued echo messages or no operational 
node in the frame has issued one; that is, the frame synchronization mode 
has successfully located a frame gap. 

Architectures Verified In the prototypical design of SPIDER, the 
reintegrator monitors no more than three nodes. The architecture of the 
SPIDER bus is a bipartite graph of six nodes (i.e., there are two disjoint 
sets of nodes, and any two nodes from distinct sets have interconnects and 
no two nodes from the same set have interconnects) [MMT02], and this 
architecture with six nodes is designed to tolerate up to two simultaneous 
Byzantine faults. 

The protocol has been verified for up to four monitored nodes, where 
one node may be faulty, (without increasing the number of non-faulty 
nodes, a greater number of faulty nodes would violate the Def. 9, the 
Majority Property). The proofs took on the order of seconds (and occa- 
sionally minutes) to complete on a machine with a gigabyte of memory. 
Although we did not attempt it, it may be possible to verify these proper- 
ties for models containing a greater number of monitored nodes if proofs 
are allowed to run on the order of hours or on a more powerful machine. 
Furthermore, strengthening the invariants would allow larger architectures 
to be verified. 

Invariant /.-Induction Proofs Because of the way in which we have 
modeled the protocol, for most lemmas, the size of k required to prove 
a lemma is invariant to the number of monitored nodes modeled. The 
size of k is dependent upon the duration of a mode (i.e., for how many 
resynchronization frames it is active) rather than on how many echos are 
received in the mode. For the architectures verified, all lemmas are proved 
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by fc-induction for k < 4. 


Lemmas From Failed Proof Attempts SAL has the capacity to 
assist the user in discovering required lemmas. It has an option such 
that when enabled, SAL will return a counterexample to a proof by fc- 
induction. Because the model is infinite, the counterexample is often sym- 
bolic. It shows a fc-trajectory over which the constraints of the infinitely- 
typed variables do not satisfy the induction step (rarely does the base case 
fail). The onus is upon the user to interpret how the constraints lead to 
a counterexample. 

Clique Avoidance Clique avoidance is the property that there exists 
exactly one operational clique in the system [RusOl, BPOO]. If more than 
one clique exists, the nodes in one clique will consider the nodes in the 
other to be either faulty or recovering, and the members of each clique 
disregard the nodes in the other. This decreases the survivability of the 
system, since each clique is smaller than it would be if all the nodes were 
in the same clique. Worse though is that multiple cliques may lead the 
processors connected to the bus architecture to loose agreement about 
the status of the bus. The bus interface unit serving as the interface 
between a processor and the other nodes in the bus architecture can only 
communicate within the clique in which it is a member. Consequently, 
multiple cliques can violate processor-level fault-tolerance requirements 
the bus is supposed to guarantee for the attached processors. 

The SPIDER architecture does not have a protocol to guarantee clique 
avoidance [TPMM05], unlike, e.g., TTP/C [BPOO, Pfe03]. However, the 
architecture is designed with the intent that if during the course of its 
operation the MFA (Sec. 1) is not violated, clique avoidance is guaran- 
teed. The analysis of the reintegration protocol supports this claim by 
demonstrating that a necessary condition for clique avoidance is met. 

Suppose the MFA is not violated and the protocols executed by the 
non-faulty nodes during startup and normal operation guarantee clique 
avoidance. Then the only opportunity for clique avoidance to be violated 
is after multiple nodes suffer transient faults and attempt to find a clique 
with which to reintegrate. If we assume clique avoidance holds while a 
node begins to reintegrate, it has only one clique to observe. By Thm. 1, 
such a node will not accuse the nodes in the single clique during reintegra- 
tion and will therefore reintegrate into it by executing the reintegration 
protocol. 

The two assumptions to the above argument are essential. First, it 
is necessary to assume the MFA is not violated. If the architecture suf- 
fers a massive failure that triggers a bus restart, scenarios exist in which 
clique avoidance is violated, although these scenarios have a low prob- 
ability [TPMM05]. Although the essential protocols that execute dur- 
ing startup and normal operation have been formally verified individu- 
ally [MGPM04], there does not yet exist a cohesive argument to demon- 
strate formally that clique avoidance is preserved. 
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7 Conclusion 


We have described a formal proof of the correctness of the SPIDER Rein- 
tegration Protocol in the SAL tool using fc-induction. We have described 
improvements to a novel formalism for real-time system that has recently 
been proposed and successfully used now in two industrial-scale verifica- 
tion projects (this and the work presented in [DS04a]). Furthermore, 
we have described a means by which event-triggered behavior can be 
modeled as time-triggered behavior. This application demonstrates that 
both appropriate formalisms and appropriate abstractions of the physical 
world [PMMG04] are necessary if non-trivial problems are to be addressed 
by formal methods. The essential means by which we achieved our results 
was by introducing synchrony into the formalism and by (conservatively) 
modeling event-triggered actions with time-triggered behavior. 

Modeling the reintegration protocol revealed two distinctions between 
this protocol and the other fault-tolerant protocols designed for SPIDER 
and similar systems. First, although the ROBUS is designed to withstand 
Byzantine faults, these sort of faults do not warrant special consideration 
when reasoning about reintegration. A node that suffers a Byzantine fault 
can send arbitrary messages to other nodes. The difficulty in designing 
distributed protocols to tolerate Byzantine faults is that different nodes 
may receive different messages from the same node. Reintegration is not 
a distributed protocol; only the reintegrator executes a reintegration pro- 
tocol, so only the messages the reintegrator receives are relevant when 
reasoning about the correctness of the protocol. 

Second, the topology of the system does not need to be modeled. The 
verification is with respect to a single node, the reintegrator. All that is of 
concern are what messages the reintegrator receives from the other nodes 
in the system. If a communication link does not exist that allows a node 
to send messages to the reintegrator, then that node is simply ignored in 
the formal model. 

The formal specification and verification of the reintegration protocol 
did not reveal any flaws in the protocol. Nevertheless, it was of value since 
no hand proofs existed to demonstrate its correctness. Furthermore, the 
protocol was significantly different from the other SPIDER protocols and 
many other well-studied fault-tolerant distributed protocols [Lyn96]. As 
well, the formal verification not only demonstrated the correctness of rein- 
tegration but it also strongly suggests that clique avoidance is preserved. 

Nevertheless, the formal verification did reveal that a more general 
assumption can be used to prove the correctness of this mode: we require 
only that P > hr + 2ir, where P is the duration of a resynchronization 
frame, n is the skew, and l is the number of faulty nodes not accused 
by the reintegrator during the first two modes. In the originally-stated 
assumption, the requirement was that P > mn + n, where m is the total 
number of monitored nodes. This latter requirement implies the former 
(since Def. 9, the Majority Property, ensures there is at least one oper- 
ational node). In the worst case (i.e., if the reintegrator trusts as many 
faulty nodes as possible for the protocol to work), they are equivalent. 

As called for in [DS04a], future work includes theoretical studies com- 
paring STA and other real-time formalisms. Other techniques for opti- 
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mizing STA for fc-induction would be useful; in particular, techniques for 
fc-induction over parameterized systems would be of much practical value. 
Also of value would be direct comparisons between the specification and 
verification of real-time systems in SAL and in other tools specifically 
designed for real-time system verification (e.g., HyTech [HHWT97], Kro- 
nos [DOTY95], Uppaal [LPY97], etc.). 

We note too that we have concerned ourselves with only timeout au- 
tomata here. In [DS04a, DS04b], Dutertre and Sorea develop a more 
complex real-time formalism they call calendar automata in which time- 
outs are also associated with the delay between when a message is sent and 
received by communicating processes. We did not model communication 
delays in our model of the reintegration protocol. The relationship be- 
tween timeout automata, synchronizing timeout automata, and calendar 
automata should be examined in more detail. 

A difficulty with fc-induction is that properties can be proved vacuously 
if the system is deadlocked. Checking for deadlocks in a infinite-state 
systems is a difficult problem. This is exacerbated by the fact that SAL’s 
language is typed, and violating typing constraints can cause deadlocks as 
well. The heuristic used to check for deadlocks was to specify properties 
we knew should be false and attempt to prove them by fc-induction. This 
is only a positive test for deadlock; a counterexample does not imply the 
system is not deadlocked. An alternative strategy that might be fruitful 
would be to construct some finite abstraction of a transition system with 
the property that if the finite abstraction contains a deadlock, then so 
does the infinite-state system. SAL contains a deadlock checker for finite 
state systems that could be then applied. This approach was not pursued 
in this work, however. 
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A STA Model of the TGC in SAL 


•/, 

/ Lee Pike 

7. NASA Langley Formal Methods Group 
7. lee.s.pike@nasa.gov 

7. 

7. SAL 2.3 

7. 

7. Adapted from the TGC model in SAL by B. Dutertre and 
7. M.Sorea, in "Timed Systems in SAL," Technical Report 
7. SRI-SDL-04-03 , July 2004. 

7. 

sta_tgc : CONTEXT = 

BEGIN 


SIGNAL 

TIME 

N 

INDEX 

TIMEOUT. ARRAY 


: TYPE = {approach, exit, lower, 
raise, null}; 

: TYPE = REAL; 

: NATURAL = 3; 

: TYPE = [1. .N] ; 

: TYPE = ARRAY INDEX OF TIME; 


T.STATE : TYPE = {tO, tl, t2, t3}; 
G.STATE : TYPE = {gO, gl , g2, g3} ; 
C.STATE : TYPE = {cO, cl, c2, c3} ; 


to_min(tl: TIME, t2: TIME, t3: TIME): TIME = 
min(tl, min(t2, t3)); 


7. 

7. Clock module: makes time elapse up to the next timeout 

l 

clock: MODULE = 

BEGIN 

INPUT 

t.timeout : TIME, 

c.timeout : TIME, 

g.timeout : TIME 

OUTPUT time: TIME 
INITIALIZATION time = 0 
TRANSITION 

[ time.elapses : 

time < to_min(t_timeout , c.timeout, g.timeout) 

— > 

time’ = to_min(t_timeout , c.timeout, g.timeout) 

] 

END; 


7 . 
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7. Train module 

7 . 

train: MODULE = 
BEGIN 
INPUT 


time 

: TIME, 

c_state 

: C_STATE 

OUTPUT 


t_timeout 

: TIME, 

msgl 

: SIGNAL 

LOCAL 


reset 

: TIME, 

t_state 

: T_STATE 

INITIALIZATION 


t_state = tO; 
msgl = null; 



TRANSITION 


[ tO_tl : 

t_state = tO 
AND t_timeout = time 
AND c_state = cO 
— > 

t_state’ = tl; 

msgl’ = approach; 

reset’ = time + 5; 

t_timeout’ IN { x: TIME I time + 2 < x 

AND x <= time + 5} 

[] tl_t2 : 

t_state = tl 
AND t_timeout = time 
— > 

msgl’ = null; 

t_state’ = t2; 

t_timeout’ IN { x: TIME I time < x 

AND x <= reset} 

[] t2_t3 : 

t_state = t2 
AND t_timeout = time 
— > 


msgl’ = null; 
t_state’ = t3; 
t_timeout’ IN { x: 

[] t3_t0: 

t_state = t3 
AND t_timeout = time 
AND c_state = c2 
— > 


TIME | time < x 

AND x <= reset} 


t_state’ = tO; 
msgl’ = exit; 
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t_timeout’ IN { x: TIME I time < x} 

[] 

ELSE — > 

] 

END; 

7. 

7. GATE module 

7. 

gate: MODULE = 

BEGIN 

INPUT 

time : TIME, 

c_timeout : TIME, 

msg2 : SIGNAL 

OUTPUT 

g_timeout : TIME, 

g_state : G_STATE 

INITIALIZATION 
g_state = gO; 

TRANSITION 
[ gO_gl: 

g_state = gO 
AND msg2 ’ = lower 
AND c_timeout = time 
— > 

g_state’ = gl ; 

g_timeout’ IN { x: TIME I time < x 

AND x <= time + l} 

[] gl_g2: 

g_state = gl 
AND g_timeout = time 
— > 

g_state’ = g2; 

g_timeout’ IN { x: TIME I time < x } 

[] g2_g3: 

g_state = g2 
AND msg2’ = raise 
AND c_timeout = time 
— > 

g_state’ = g3; 

g_timeout’ IN { x: TIME I time + 1 <= x 

AND x <= time + 27 

[] g3_g0 : 

g_state = g3 
AND g_timeout = time 
— > 

g_state’ = gO; 

g_timeout’ IN { x: TIME I time < x} 

[] 
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ELSE — > 

] 

END; 


7 . 

*/. Controller module 

7 . 

controller : MODULE = 


BEGIN 

INPUT 

time 

t_timeout 

msgl 

g_state 

OUTPUT 

c_timeout 

msg2 

c_state 

INITIALIZATION 
c_state = cO; 
msg2 = null; 
TRANSITION 
[ cO_cl: 

c_state = 


TIME, 

TIME, 

SIGNAL, 

G_STATE 

TIME, 

SIGNAL, 

C_STATE 


cO 


AND t_timeout = time 


AND msgl ’ = approach 
— > 


c_state’ = cl; 
c_timeout’ = time + 1 
[] cl_c2: 

c_state = cl 
AND c_timeout = time 
AND g_state = gO 


c_state’ = c2; 
msg2’ = lower; 

c_timeout’ IN { x: TIME I time < x } 
[] c2_c3: 

c_state = c2 
AND msgl’ = exit 
AND t_timeout = time 
— > 


c_state’ = c3; 
c_timeout’ IN { x: 

[] c3_c0: 

c_state = c3 
AND c_timeout = time 
AND g_state = g2 
— > 


TIME | time < x 

AND x <= time + 1} 
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c_state’ = cO; 
msg2’ = raise; 

c_timeout’ IN { x: TIME I time < x} 

[] 

ELSE — > 

] 

END; 

tgc : MODULE = train I I gate I I controller; 
system: MODULE = clock [] tgc; 


*/. 

7. properties 

7 . 

7, proved d9 

safe: LEMMA system [- G(t_state = t2 => g_state = g2) ; 


7 . 

7. liveness checks 

7 . 


tstate2 : 

LEMMA 

system 

|- G(t_state 

/= t2) 

gstate2 : 

LEMMA 

system 

|- G(g_state 

/= g2) 

cstate2 : 

LEMMA 

system 

|- G(c_state 

/= c2) 

tstate3 : 

LEMMA 

system 

|- G(t_state 

/= t3) 

gstate3 : 

LEMMA 

system 

|- G(g_state 

/= g3) 

cstate3 : 

LEMMA 

system 

|- G(c_state 

/= c3) 


END 
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B STA Model of the TGC with Clockless 
Semantics in SAL 


/ 

7. Lee Pike 

7. NASA Langley Formal Methods Group 
7. lee.s.pike@nasa.gov 

7. 

7. SAL 2.3 

7. 

7. Adapted from the TGC model in SAL by B. Dutertre and 
7. M.Sorea, in "Timed Systems in SAL," Technical Report 
7. SRI-SDL-04-03 , July 2004. 


/ 0 

sta_tgc_clockless : CONTEXT = 
BEGIN 


SIGNAL 

TYPE = 

{approach, exit, lower, 
raise, null}-; 

TIME 

TYPE = 

REAL; 

N 

NATURAL = 3; 

INDEX 

TYPE = 

[1 . -N] ; 

TIMEOUT. ARRAY 

TYPE = 

ARRAY INDEX OF TIME; 

T.STATE : TYPE = {tO, 

tl, t2. 

t3} ; 

G.STATE : TYPE = {gO, 

gl> g2. 

g3}; 

C.STATE : TYPE = {cO, 

cl, c2. 

c3} ; 

to_min(tl: TIME, t2: TIME, t3: TIME): TIME = 
min(tl, min(t2, t3)); 

•/ 


7. Train module 

7. 

train: MODULE = 


BEGIN 

INPUT 

c.timeout 

: TIME, 

g.timeout 

: TIME, 

c.state 

: C.STATE 

OUTPUT 

t.timeout 

: TIME, 

msgl 

: SIGNAL 

LOCAL 

reset 

: TIME, 

t.state 

: T.STATE 


INITIALIZATION 
t_state = tO; 
msgl = null; 
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TRANSITION 
[ tO_tl : 

t_state = tO 

AND t_timeout = to_min(t_timeout , c_timeout, g_timeout) 
AND c_state = cO 
— > 

t_state’ = tl; 
msgl’ = approach; 
reset’ = t_timeout + 5; 

t_timeout’ IN { x: TIME I t_timeout + 2 < x 

AND x <= t_timeout + 5} 

[] tl_t2 : 

t_state = tl 

AND t_timeout = to_min(t_timeout , c_timeout, g_timeout) 
— > 


msgl’ = null; 
t_state’ = t2; 

t_timeout’ IN { x: TIME I t_timeout < x 

AND x <= reset} 

[] t2_t3 : 

t_state = t2 

AND t_timeout = to_min(t_timeout , c_timeout, g_timeout) 
— > 


msgl’ = null; 
t_state’ = t3; 

t_timeout’ IN { x: TIME I t_timeout < x 

AND x <= reset} 


[] t3_t0: 

t_state = t3 

AND t_timeout = to_min(t_timeout , c_timeout, g_timeout) 
AND c_state = c2 
— > 


t_state’ = tO; 
msgl’ = exit; 

t_timeout’ IN { x: TIME I t_timeout < x} 

[] 

ELSE — > 

] 

END; 


7 . 

*/. GATE module 

/ 

gate: MODULE = 

BEGIN 

INPUT 

c_timeout : TIME, 

t_timeout : TIME, 

msg2 : SIGNAL 

OUTPUT 
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g_timeout : TIME, 

g_state : G_STATE 

INITIALIZATION 
g_state = gO; 

TRANSITION 
[ gO_gl: 

g_state = gO 
AND msg2 ’ = lower 

AND c_timeout = to_min(t_timeout , c_timeout, g_timeout) 
— > 


g_state’ = gl ; 

g_timeout’ IN { x: TIME I c_timeout < x 

AND x <= c_timeout + 1} 


[] gl_g2: 

g_state = gl 

AND g_timeout = to_min(t_timeout , c_timeout, g_timeout) 
— > 


g_state’ = g2; 

g_timeout’ IN { x: TIME I g_timeout < x } 

[] g2_g3: 

g_state = g2 
AND msg2’ = raise 

AND c_timeout = to_min(t_timeout , c_timeout, g_timeout) 
— > 


g_state’ = g3; 

g_timeout’ IN { x: TIME I c_timeout + 1 <= x 

AND x <= c_timeout + 2} 


[] g3_g0 : 

g_state = g3 

AND g_timeout = to_min(t_timeout , c_timeout, g_timeout) 
— > 


g_state’ = gO; 

g_timeout’ IN { x: TIME I g_timeout < x} 


ELSE — > 

] 


END; 


7 . 

7. Controller module 

l 

controller : MODULE = 


BEGIN 

INPUT 

t_timeout 

g_timeout 

msgl 

g_state 

OUTPUT 

c_timeout 


: TIME, 

: TIME, 

: SIGNAL, 

: G_STATE 

: TIME, 
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: SIGNAL, 

: C_STATE 


msg2 
c_state 

INITIALIZATION 
c_state = cO; 
msg2 = raise; 

TRANSITION 
[ cO_cl : 

c_state = cO 

AND t_timeout = to_min(t_timeout , c_timeout, g_timeout) 
AND msgl ’ = approach 
— > 

c_state’ = cl; 
c_timeout’ = t_timeout + 1 
[] cl_c2: 

c_state = cl 

AND c_timeout = to_min(t_timeout , c_timeout, g_timeout) 
AND g_state = gO 


c_state’ = c2; 
msg2’ = lower; 

c_timeout’ IN { x: TIME I c_timeout < x } 

[] c2_c3: 

c_state = c2 
AND msgl’ = exit 

AND t_timeout = to_min(t_timeout , c_timeout, g_timeout) 
— > 


c_state’ = 
c_timeout ’ 

[] c3_c0: 

c_state = 
AND c_timeout 
AND g_state = 
— > 


c3 ; 

IN { x: TIME I t_timeout < x 

AND x <= t_timeout + 1} 


c3 

= to_min(t_timeout , c_timeout, g_timeout) 
g2 


c_state’ = cO; 
msg2’ = raise; 

c_timeout’ IN { x: TIME I c_timeout < x} 

[] 

ELSE — > 

] 

END; 


system: MODULE = train I I gate I I controller; 


7 . 

"/o properties 

/ 

7. proved d5 

safe: LEMMA system | - G(t_state = t2 => g_state = g2) ; 
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/ 

*/. liveness checks 

7 . 

tstate2: LEMMA system |- G(t_state /= t2) ; 

gstate2: LEMMA system |- G(g_state /= g2) ; 

cstate2: LEMMA system |- G(c_state /= c2) ; 

tstate3: LEMMA system |- G(t_state /= t3) ; 

gstate3: LEMMA system |- G(g_state /= g3) ; 

cstate3: LEMMA system |- G(c_state /= c3) ; 

END 
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C STA Model of the Reintegration Pro- 
tocol in SAL 


/ 

7. Lee Pike 

7. NASA Langley Formal Methods Group 
7. lee.s.pike@nasa.gov 

7. 

7. Compatible with SAL 2.3 & SAL 2.4 

7. 

reint: CONTEXT = 

BEGIN 


7. TYPES AND CONSTANTS 

7. The nonnegative reals. 

TIME : TYPE = {x: REAL I 0 <= x}; 

MODES : TYPE = {pd_mode, fs_mode, sc_mode}-; 

CNTRL : TYPE = {active, deactive}; 


7. 

7. IF THE NUMBER OF NODES ARE CHANGED, UPDATE THESE TYPES AND 
7. CONSTANTS APPROPRIATELY. 

7. The user must ensure that there are enough operational 
7. nodes to ensure the majority property always holds 
7. (this requires the majority of nodes to be operational) . 

7. Number of abstract operational nodes . 
op_total : NATURAL = 2; 

7. Number of abstract bad nodes . 
bad_total : NATURAL = 1 ; 

7. Total number of abstract nodes . 

total : NATURAL = op_total + bad_total; 

7. The set of all abstract nodes. 

ALL_IDS : TYPE = {x: [1.. total] I x=l OR x=2 OR x=3}; 

ALL_CNT : TYPE = {x: [0.. total] I x=0 OR x=l 

OR x=2 OR x=3} ; 

7. The set of abstract operational nodes. 

OP_IDS : TYPE = {x: [1 . . op_total] I x=l OR x=2} ; 

7. Where the indexing of bad nodes starts. 
b_st : NATURAL = op_total + 1 ; 

7. The set of abstract bad nodes. 

BAD_IDS : TYPE = {x: [b_st . .total] I x=3>; 

7. 


7. The size of the seen variable in preliminary diagnosis . 
pd_see_top : NATURAL = 3; 

7. The set of possible seen vals in preliminary diagnosis . 
PD.SEEN : TYPE = {x: [0 . ,pd_see_top] 
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I x=0 OR x=l OR x=2 
OR x=pd_see_top} ; 

7. The set of times at which abtract operational nodes send 
7, echos . 

0P_ECH0S : TYPE = ARRAY 0P_IDS OF TIME; 

7. Each bad node has an array of times at which it echos. 

7. The array is as big as the number of times that echos 
7 can be seen in a mode . 

BAD_ECH0_ ARRAY : TYPE = ARRAY [1 . ,pd_see_top] OF TIME; 

7. The set of echo arrays for the abstract bad nodes. 

BAD_ECH0S : TYPE = ARRAY BAD.IDS OF BAD_ECHO_ARRAY ; 

7. The reintegrator’s set of accusations of operational nodes. 
0P_ACCS : TYPE = ARRAY 0P_IDS OF BOOLEAN; 

7. The reintegrator’s set of accusations of bad nodes. 

BAD_ACCS : TYPE = ARRAY BAD.IDS OF BOOLEAN; 

7. The reintegrator’s record of how many times an operational 
7. node has been seen in preliminary diagnosis . 

PD_0P_SEEN : TYPE = ARRAY 0P_IDS OF PD_SEEN ; 

7. The reintegrator’s record of how many times an bad node 
7. has been seen in preliminary diagnosis. 

PD_BAD_SEEN : TYPE = ARRAY BAD.IDS OF PD_SEEN ; 

7. maximum skew between operational nodes 
pi : -Ct : TIME I 0 < t>; 

7. Length of synchronization frame. Constrained by skew. 

P : {t : TIME I t > pi* (bad_total + 2)}-; 

7. 


7. FUNCTIONS 

7. Is the node id of an operational node? 
operational? (i : ALL_IDS) : BOOLEAN = i <= op_total; 


7. 

7. Functions for monitoring echos from faulty nodes in the 
7. preliminary diagnosis mode. 

pd_bad_see_rec (bad_ecs : BAD_ECHO_ARRAY, ind: PD_SEEN, 

start: TIME, ending: TIME, seen: PD_SEEN) 

: PD_SEEN = 

IF ind=0 THEN seen 

ELSIF (bad_ecs [ind] > start AND bad_ecs [ind] <= ending) 

THEN pd_bad_see_rec(bad_ecs, ind-1, start, ending, seen+1) 

ELSE pd_bad_see_rec(bad_ecs, ind-1, start, ending, seen) ENDIF 

pd_bad_see(seen: PD_SEEN, bad_ecs: BAD_ECHO_ARRAY, 
start: TIME, ending: TIME): PD_SEEN = 

IF pd_bad_see_rec (bad_ecs , pd_see_top, start, ending, 0) 

+ seen 

>= pd_see_top 
THEN pd_see_top 
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ELSE pd_bad_see_rec (bad_ecs , pd_see_top, start, ending, 0) 
+ seen 
ENDIF ; 


pd_bad_echos(seen: PD_SEEN, bad_ec : BAD_ECH0_ARRAY, 
start: TIME, ending: TIME): PD_SEEN = 
pd_bad_see (seen, bad_ec, start, ending); 


7. 


7. Functions for monitoring echos from faulty nodes in the 
7. preliminary diagnosis mode. 
pd_op_echos(seen: PD_SEEN, op_ec: TIME, 

start: TIME, ending: TIME): PD_SEEN = 

IF (op_ec > start AND op_ec <= ending AND seen < pd_see_top) 
THEN seen+1 ELSE seen ENDIF; 


7. 

7. Functions for finding the last valid echo in the 

7, frame synchronization mode transitions. 

7. No echos from eligible nodes within pi ticks. 

none_in_pi? (reint_to : TIME, op_echos: 0P_ECH0S, 

bad_echos: BAD_ECH0S, fs_op_seen: 0P_ACCS, 
fs_bad_seen: BAD_ACCS, op_accs: QP_ACCS, 
bad.accs: BAD.ACCS) : BOOLEAN = 

(FORALL (i: 0P_IDS) : 

( op_echos[i] > reint_to 

AND (NOT op_accs [i] ) AND (NOT f s_op_seen [i] ) ) 

=> op_echos[i] > pi+reint_to) 

AND (FORALL (i: BAD.IDS) : 

((NOT bad_accs [i] ) AND (NOT f s_bad_seen [i] ) ) 

=> bad_echos [i] [1] > pi+reint_to) ; 

7. Defined as a predicate rather than a recursive function; 

7. see Dutertre and Sorea’s tech report for an explanation. 

7. True at the time of the last eligible echo in pi ticks. 

last_in_pi? (t : TIME, reint_to: TIME, op_echos: 0P_ECH0S, 
bad_echos: BAD_ECH0S, fs_op_seen: 0P_ACCS, 
fs_bad_seen: BAD_ACCS, op_accs: 0P_ACCS, 
bad_accs: BAD_ACCS, frame_to: TIME): BOOLEAN = 
(FORALL (i: 0P_IDS) : 

( op_echos[i] > reint_to 

AND op_echos[i] <= pi+reint_to 
AND (NOT op_accs [i] ) AND (NOT f s_op_seen [i] ) ) 

=> t >= op_echos[i]) 

AND (FORALL (i: BAD.IDS) : 

( bad_echos [i] [1] <= pi+reint_to 

AND (NOT bad_accs [i] ) AND (NOT f s_bad_seen [i] ) ) 
=> t >= bad_echos [i] [1] ) 

AND ( (EXISTS (i: 0P_IDS) : 
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(NOT op_accs [i] ) AND (NOT f s_op_seen [i] ) 

AND op_echos[i] <= pi+reint_to 
AND op_echos[i] < frame_to 
AND t = op_echos [i] ) 

OR (EXISTS ( j : BAD.IDS) : 

(NOT bad_accs[j]) AND (NOT fs_bad_seen[j] ) 
AND bad_echos [j] [1] <= pi+reint_to 
AND bad_echos [j] [1] < frame_to 
AND t = bad_echos [j] [1] ) 

OR (frame_to < reint_to+pi AND t = frame_to)); 

7. 

i 

7. Functions for counting the number of accused 
7. for the synchronization mode. 

not_accd_rec (op_accs : 0P_ACCS, bad_accs : BAD_ACCS, 
i: ALL_CNT, cnt : ALL_CNT) : ALL_CNT = 

IF i=0 THEN cnt 

ELSE (IF operational? (i) 

THEN 

(IF op_accs [i] 

THEN not_accd_rec (op_accs , bad_accs, i-1, cnt) 

ELSE not_accd_rec (op_accs , bad_accs, i-1, cnt+1) 
ENDIF) 

ELSE 

(IF bad_accs[i] 

THEN not_accd_rec (op_accs , bad_accs, i-1, cnt) 

ELSE not_accd_rec (op_accs , bad_accs, i-1, cnt+1) 
ENDIF) 

ENDIF) 

ENDIF; 

not_accd(op_accs : 0P_ACCS, bad_accs: BAD_ACCS) : ALL_CNT = 
not_accd_rec (op_accs , bad_accs, total, 0); 

7. 

7. How many nodes seen in synch capture mode . 
sc_seen_rec(sc_op_seen: 0P_ACCS, sc_bad_seen: BAD_ACCS, 
i: ALL_CNT, cnt: ALL_CNT) : ALL_CNT = 

IF i = 0 THEN cnt 
ELSE 

(IF operational? (i) 

THEN 

(IF sc_op_seen [i] 

THEN sc_seen_rec (sc_op_seen, sc_bad_seen, i-1, cnt+1) 
ELSE sc_seen_rec (sc_op_seen, sc_bad_seen, i-1, cnt) 
ENDIF) 

ELSE 

(IF sc_bad_seen [i] 

THEN sc_seen_rec (sc_op_seen, sc_bad_seen, i-1, cnt+1) 
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ELSE sc_seen_rec (sc_op_seen, sc_bad_seen, i-1, cnt) 
ENDIF) 

ENDIF) 

ENDIF; 

sc_seen_total (sc_op_seen : OP_ACCS , 

sc_bad_seen: BAD_ACCS) : ALL_CNT = 
sc_seen_rec(sc_op_seen, sc_ba.d_seen, total, 0); 

7. Defined as a predicate rather than a recursive function; 

7. see Dutertre and Sorea’s tech report for an explanation. 

7. Determining the next valid echo for the synchronization 
7. mode transitions. 

next?(t: TIME, reint_to: TIME, op_echos: 0P_ECH0S, 
bad_echos: BAD_ECH0S, fs_op_seen: 0P_ACCS, 
fs_bad_seen: BAD_ACCS, op_accs: 0P_ACCS, 
bad_accs: BAD_ACCS, frame_to: TIME): BOOLEAN = 

(FORALL (i: 0P_IDS) : 

( op_echos[i] > reint_to 

AND (NOT op_accs [i] ) AND (NOT f s_op_seen [i] ) ) 
=> t <= op_echos[i]) 

AND (FORALL (i: BAD.IDS) : 

(NOT bad_accs[i]) AND (NOT f s_bad_seen [i] ) 

=> t <= bad_echos [i] [1] ) 

AND ( (EXISTS (i: 0P_IDS) : 

(NOT op_accs [i] ) AND (NOT f s_op_seen [i] ) 
AND reint_to < op_echos[i] 

AND op_echos[i] < frame_to 
AND t = op_echos [i] ) 

OR (EXISTS ( j : BAD_IDS) : 

(NOT bad_accs[j]) AND (NOT fs_ba.d_seen[j] ) 
AND bad_echos [j] [1] < frame_to 
AND t = bad_echos [j] [1] ) 

OR t = frame_to) ; 

7. Ensures the array of echos from faulty nodes satisfy 
7. are ascending. 

ascending? (be : BAD_ECHO_ARRAY, reint_to: TIME): BOOLEAN = 
FORALL (e: [1 . ,pd_see_top-l] ) : be [e] > reint_to 

AND be [e] < be[e+l]; 

7. 


7. MODEL 

7. MODES 

modes: MODULE = 

BEGIN 

INPUT 

pd_cntrl : CNTRL, 
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fs_cntrl : CNTRL, 

sc_cntrl : CNTRL 

OUTPUT 

mode : MODES 

INITIALIZATION 
mode = pd_mode 
TRANSITION 
[ 

mode = pd_mode 
— > 

mode’ = IF pd_cntrl=active THEN mode ELSE fs_mode 
ENDIF 

[] 

mode = fs_mode 
— > 

mode’ = IF f s_cntrl=active THEN mode ELSE sc_mode 
ENDIF 

[] 

ELSE — > 

] 

END; 


7. PRELIMINARY DIAGNOSIS MODE 

preliminary_diagnosis_mode : MODULE = 
BEGIN 
INPUT 


mode 

frame_to 

op_echos 

bad_echos 

OUTPUT 

pd_cntrl 

LOCAL 

pd_f inish 
pd_op_seen 
pd_bad_seen 
GLOBAL 
op_accs 
bad_accs 
reint_to 
INITIALIZATION 


: MODES, 

: TIME, 

: OP_ECHOS , 

: BAD_ECH0S 

: CNTRL 

: TIME, 

: PD_OP_SEEN, 

: PD_BAD_SEEN 

: 0P_ACCS , 

: BAD.ACCS, 

: TIME 


pd_cntrl = active; 

op_accs = [ [i : 0P_IDS] FALSE]; 

bad_accs = [ [i : BAD_IDS] FALSE]; 

pd_op_seen = [ [i : 0P_IDS] 0]; 

pd_bad_seen = [ [i : BAD_IDS] 0]; 

reint_to IN {t : TIME I t >= 0 AND t < P}; 

pd_finish = reint_to+P+pi 
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TRANSITION 

[ 

mode ’ = pd_mode 
AND frame_to < pd_finish 
— > 

reint_to’ = frame_to; 
pd_op_seen’ = [[i: 0P_IDS] 

IF (NOT op_accs [i] ) 

THEN pd_op_echos (pd_op_seen [i] , 
op_echos [i] , 
reint_to, reint_to’) 
ELSE pd_op_seen [i] ENDIF] ; 
pd_bad_seen’ = [ [i : BAD_IDS] 

IF (NOT bad_accs[i]) 

THEN pd_bad_echos(pd_bad_seen[i] , 
bad_echos [i] , 
reint_to , 
reint_to ’ ) 

ELSE pd_bad_seen[i] ENDIF]; 
op_accs’ = [[i: OP_IDS] 

op_accs [i] 

OR pd_op_seen’ [i] = pd_see_top] ; 
bad_accs ’ = [ [i : BAD.IDS] 

bad_accs [i] 

OR pd_bad_seen’ [i] = pd_see_top] 

[] 1 

mode ’ = pd_mode 
AND frame_to >= pd_finish 
— > 

pd_cntrl’ = deactive; 
reint_to’ = pd_finish; 
pd_op_seen’ = [[i: OP_IDS] 

IF (NOT op_accs [i] ) 

THEN pd_op_echos (pd_op_seen [i] , 
op_echos [i] , 
reint_to, reint_to’) 
ELSE pd_op_seen [i] ENDIF]; 
pd_ba.d_seen’ = [ [i : BAD_IDS] 

IF (NOT bad_accs[i]) 

THEN pd_bad_echos(pd_bad_seen[i] , 
bad_echos [i] , 
reint_to, reint_to’) 
ELSE pd_bad_seen[i] ENDIF]; 
op_accs’ = [[i: OP_IDS] 

op_accs [i] 

OR pd_op_seen’ [i] = pd_see_top 
OR pd_op_seen’ [i] = 0] ; 
bad_accs ’ = [ [i : BAD.IDS] 

bad_accs [i] 

OR pd_bad_seen’ [i] = pd_see_top 
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DR pd_bad_seen’ [i] = 0] 


] 


END; 

7 . — 


7. FRAME SYNCHRONIZATION 

f rame_synchronization_mode : MODULE = 
BEGIN 
INPUT 


mode 


MODES, 

op_echos 


OP_ECHOS , 

bad_echos 


BAD_ECHOS , 

frame_to 


TIME 

OUTPUT 



f s_cntrl 


CNTRL 

LOCAL 



f s_op_seen 


OP_ACCS , 

f s_bad_seen 


BAD_ACCS 

GLOBAL 



op_accs 


OP_ACCS , 

bad_accs 


BAD.ACCS , 

reint_to 


TIME 

INITIALIZATION 



fs_cntrl = deactive; 

fs_op_seen = 

[[i 

OP_IDS] FALSE] ; 

fs_bad_seen = 

[ [i : BAD_IDS] FALSE] 

TRANSITION 

[ 

mode ’ = 



f S. 

.mode 


AND NOT none_in_pi? (reint_to , op_echos, bad_echos, 
fs_op_seen, fs_bad_seen, 
op_accs, bad_accs) 

— > 

fs_cntrl’ = active; 
reint_to’ IN {t : TIME 

I last_in_pi?(t , reint_to, 

op_echos, bad_echos, 
op_accs, bad_accs, 
f s_op_seen, 

fs_bad_seen, frame_to)}-; 

fs_op_seen’ = [[i: OP_IDS] 

f s_op_seen [i] 

OR ( op_echos [i] > reint_to 

AND op_echos [i] <= reint_to’)]; 
fs_bad_seen’ = [ [i : BAD_IDS] 

f s_bad_seen [i] 

OR bad_echos [i] [1] <= reint_to’]; 
op_accs’ = [[i: OP_IDS] 

op_accs [i] 
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OR ( op_echos [i] > reint_to 
AND op_echos [i] <= reint_to’ 

AND fs_op_seen[i] )] ; 
bad_accs ’ = [ [i : BAD.IDS] 

bad_accs [i] 

OR ( bad_echos [i] [1] <= reint_to’ 
AND f s_bad_seen[i] )] ; 

[] 

mode’ = fs_mode 

AND none_in_pi?(reint_to , op_echos, bad_echos, 
op_accs, bad_accs, 
fs_op_seen, fs_bad_seen) 

— > 

fs_cntrl’ = deactive; 
reint_to’ = reint_to+pi; 
op_accs’ = [[i: 0P_IDS] 

op_accs [i] 

OR ( op_echos [i] > reint_to 
AND op_echos [i] <= reint_to’ 

AND fs_op_seen[i] )] ; 
bad_accs ’ = [ [i : BAD.IDS] 

bad_accs [i] 

OR C bad_echos [i] [1] <= reint_to’ 
AND f s_bad_seen[i] )] ; 

] 

END; 

7 . 


7. SYNCHRONIZATION CAPTURE MODE 
synch_capture_mode : MODULE = 
BEGIN 
INPUT 


mode 

: MODES, 

frame_to 

: TIME, 

op_echos 

: OP_ECHOS , 

bad_echos 

: BAD_ECHOS 

OUTPUT 


sc_cntrl 

: CNTRL 

LOCAL 


sc_op_seen 

: OP_ACCS , 

sc_bad_seen 

: BAD_ACCS 

GLOBAL 


reint_to 

: TIME, 

op_accs 

: OP_ACCS , 

bad_accs 

: BAD_ACCS 

INITIALIZATION 


sc_cntrl = deactive; 

sc_op_seen = 

[ [i : OP_IDS] FALSE]; 

sc_bad_seen = 

[ [i : BAD.IDS] FALSE] 
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TRANSITION 

[ 

mode’ = sc_mode 

AND sc_seen_total(sc_op_seen, sc_bad_seen) 

<= not_accd(op_accs , bad_accs)/2 

— > 

sc_cntrl’ = active; 
reint_to’ IN {t : TIME 

I next?(t, reint_to, 

op_echos, bad_echos, 
op_accs, bad_accs, sc_op_seen, 
sc_bad_seen, frame_to)}; 
sc_op_seen’ = [[i: 0P_IDS] 

sc_op_seen [i] 

OR op_echos [i] = reint_to’]; 
sc_bad_seen’ = [[i; BAD_IDS] 

sc_bad_seen [i] 

OR bad_echos [i] [1] = reint_to’]; 

[] 

mode’ = sc_mode 

AND sc_seen_total (sc_op_seen, sc_bad_seen) 

> not_accd(op_accs , bad_accs)/2 

— > 

sc_cntrl’ = deactive 

] 

END; 

7 . 


7. CLIQUE 

op.node [i : 0P_IDS] : MODULE = 
BEGIN 
INPUT 

frame_to: TIME, 
reint_to: TIME 
OUTPUT 


op_echo: TIME 
INITIALIZATION 


op_echo IN {t : TIME I frame_to > t AND t > frame_to-pi} 
TRANSITION 
[ 


frame_to <= reint_to’ 
— > 

op_echo’ IN {t : TIME 


[] 


frame_to’ > t 
AND t > frame_to’ -pi} 


ELSE — > 

] 


END; 
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P_update : MODULE = 

BEGIN 

INPUT 

reint_to: TIME 
LOCAL 

new_frame: BOOLEAN 
OUTPUT 

frame_to: TIME 
INITIALIZATION 

frame_to = IF reint_to >= pi THEN P+pi ELSE pi 
ENDIF ; 

new_frame = FALSE 
TRANSITION 
[ 

frame_to <= reint_to’ 

— > 

frame_to’ = frame_to+P; 
new_frame’ = TRUE 

[] 

ELSE — > 

new_frame’ = FALSE 

] 

END; 

op_nodes : MODULE = 

WITH OUTPUT op_echos : OP_ECHOS 

(II (i: OP_IDS) : RENAME op_echo TO op_echos [i] 
IN op_node[i]); 

clique: MODULE = op_nodes I I P_update; 


7. BAD NODES 

bad_node [i : BAD.IDS] : MODULE = 

BEGIN 

INPUT 

reint_to: TIME 
OUTPUT 

bad_echo : BAD_ECHO_ARRAY 
INITIALIZATION 

bad_echo IN {be: BAD_ECHO_ ARRAY 

I ascending? (be , reint_to)} 

TRANSITION 

[ 

bad_echo [1] <= reint_to’ 

— > 

bad_echo’ IN {be: BAD_ECHO_ ARRAY 

I ascending? (be , reint_to’)}' 
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[] 

ELSE — > 

] 

END; 

bad_nodes : MODULE = 

WITH OUTPUT bad_echos : BAD.ECHOS 

(II (i: BAD.IDS) : RENAME bad_echo TO bad_echos [i] 
IN bad_node [i] ) ; 

7. 


*/. FULL SYSTEM 

base_modes : MODULE = 

preliminary_diagnosis_mode 

[] 

f rame_synchronization_mode 

[] 

synch_capture_mode ; 

reintegrator: MODULE = base_modes I I modes; 

system: MODULE = reintegrator I I clique I I bad_nodes; 

7 . 


7. CONJECTURES 

7. Depths for a model with 2-3 operational nodes and 
7. one faulty node. 

7. In the comments above a conjecture, "dn", where n is a 
7. natural number, is the k-induction depth. The lemmas 
7. follow, preceded by "-1." 

7. NOTE: With a great number of nodes, it is often useful to 
7. disable disable expensive buchi automata optimizations 
7. by setting the — disable-expensive-ba-opt flag. 

7. SYSTEM INVARIANTS 

7. If a mode is enabled, then the other modes are deactive. 

7. proved dl 
mode_cntrl : LEMMA 

system I — G ( ( mode = pd_mode 

=> ( fs_cntrl = deactive 

AND sc_cntrl = deactive)) 

AND ( mode = fs_mode 

=> ( pd_cntrl = deactive 

AND sc_cntrl = deactive)) 

AND ( mode = sc_mode 
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=> ( pd_cntrl = deactive 

AND fs_cntrl = deactive))); 

7. Echos from operational nodes satisfy the frame property. 
7. proved dl 
frame_prop: LEMMA 

system |- G(FDRALL (i: 0P_IDS) : 

f rame_to > op_echos [i] 

AND f rame_to-op_echos [i] < pi); 

7. 


7. PRELIMINARY DIAGNOSIS INVARIANTS 

7. Bounds the finish time for the preliminary diagnosis mode. 
7. proved dl 

pd_finish: LEMMA system |- G(pd_finish < 2*P+pi) ; 

7. No operational nodes are accused upon initialization. 

7. proved dl 

pd_init_op_accs : LEMMA 

system |- G(F0RALL (i: 0P_IDS) : 

( mode = pd_mode AND pd_op_seen [i] = 0 
AND pd_cntrl = active) 

=> NOT op_accs [i] ) ; 

7. Operational nodes are seen no more than twice during the 
7. preliminary diagnosis mode. 

7. proved d4 -1 pd_finish -1 mode_cntrl 
op_seen_less2 : LEMMA 

system |- G(F0RALL (i: 0P_IDS) : pd_op_seen[i] <= 2); 

7. Operational nodes are seen no less than once during the 
7. preliminary diagnosis mode. 

7. proved d3 -1 mode_cntrl -1 pd_init_op_accs 
op_seen_morel : LEMMA 

system |- G( pd_cntrl = deactive 

=> FORALL (i: 0P_IDS) : pd_op_seen [i] >= 1); 

7. No operational nodes are accused in the preliminary 
7. diagnosis mode . 

7. proved dl -1 op_seen_morel -1 op_seen_less2 
pd_no_op_accs : LEMMA 

system |- G( mode = pd_mode 

=> FORALL (i: 0P_IDS) : NOT op_accs[i]); 

7. The frame synchronization mode seen variables are 
7. unchanged in the preliminary diagnosis mode. 

7. proved dl 
pd_not_f s_seen: LEMMA 

system |- G( mode = pd_mode 
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=> ( (FORALLCi: OP_IDS) : NOT f s_op_seen [i] ) 

AND (FORALLCi: BAD_IDS) : NOT f s_bad_seen [i] ) ) ) 

7. The synchronization capture mode seen variables are 
7. unchanged in the preliminary diagnosis mode. 

7. proved dl 
pd_not_sc_seen: LEMMA 

system |- G( mode = pd_mode 

=> ( (FORALLCi: 0P_IDS) : NOT sc_op_seen [i] ) 

AND (FORALLCi: BAD_IDS) : NOT sc_bad_seen [i] ) ) ) 

l 


7. FRAME SYNCHRONIZATION INVARIANTS 

7. No operational nodes are accused when the frame 
7. synchronization mode initializes . 

7. proved dl -1 pd_no_op_accs 
f s_init_no_op_accs : LEMMA 

system |- G(F0RALL (i: 0P_IDS) : 

(mode = fs_mode AND NOT f s_op_seen [i] ) 
=> NOT op_accs [i] ) ; 

7. The frame gap is found at the conclusion of the 
7. frame synchronization mode. 

7. proved d3 -1 pd_no_op_accs -1 f s_init_no_op_accs 

7. _ 1 frame_prop -1 pd_not_f s_seen 

f s_f rame_gap : LEMMA 

system |- G( (mode = fs_mode AND fs_cntrl = deactive) 
=> (FORALL (i: 0P_IDS) : 

reint_to < op_echos [i] ) ) ; 

7. Demonstrates that the reintegrator’s timeout has passed 
7. the echo of an operational node in the current frame 
7. if it has seen the echo already. 

7. proved d3 -1 mode_cntrl -1 pd_not_f s_seen 

7. _ 1 f s_init_no_op_accs -1 pd_no_op_accs 

7. -1 frame_prop -1 fs_frame_gap 

fs_window: LEMMA 

system |- G( fs_cntrl = active 

=> (FORALL (i: 0P_IDS) : 

(reint_to > frame_to-pi 
AND fs_op_seen[i] 

AND NOT reint_to = op_echos[i]) 

=> reint_to > op_echos [i] ) ) ; 

7. No operational nodes are accused in the frame 
7. synchronization mode . 

7. proved d3 -1 mode_cntrl -1 pd_not_f s_seen 
7. _ 1 f s_init_no_op_accs -1 pd_no_op_accs 

7. -1 frame_prop -1 fs_window 
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f s_no_op_accs : LEMMA 

system |- G( mode = fs_mode 

=> FQRALL (i: 0P_IDS) : NOT op_accs[i]); 

7. The seen state variables for the synchronization capture 
7. mode are invariant in the frame synchronization mode. 

7, proved dl -1 pd_not_sc_seen 
f s_not_sc_seen: LEMMA 

system |- G( mode = fs_mode 

=> ( (F0RALL(i: 0P_IDS) : NOT sc_op_seen [i] ) 

AND (FORALLCi: BAD_IDS) : NOT sc_bad_seen [i] ) ) ) 

7. 


7. SYNCHRONIZATION CAPTURE INVARIANTS 

7. No operational nodes are accused during the reintegration 
7. protocol . 

7. proved dl -1 f s_no_op_accs -1 pd_no_op_accs 
no_op_accs: THEOREM 

system |- G(F0RALL (i: 0P_IDS) : NOT op_accs [i] ) ; 


7. When the synchronization protocol initializes, the 
7. reintegrator is in a frame gap. 

7. proved dl -1 mode_cntrl -1 frame_prop -1 no_op_accs 
7. _ 1 f s_not_sc_seen -1 fs_frame_gap 

sc_init_f rame_gap : LEMMA 

system |- G( mode = sc_mode 

=> (FORALL (i: 0P_IDS) : 

NOT sc_op_seen [i] 

=> reint_to < op_echos [i] ) ) ; 

7. The reintegration protocol synchronizes the reintegrator. 
7. proved d4 -1 mode_cntrl -1 frame_prop -1 no_op_accs 
7. -1 f s_not_sc_seen -1 sc_init_f rame_gap 

synched: THEOREM 

system |- G((mode = sc_mode AND sc_cntrl = deactive) 

=> (FORALL (i: 0P_IDS) : 

IF reint_to >= op_echos [i] 

THEN reint_to - op_echos[i] < pi 
ELSE op_echos[i] - reint_to < pi 
ENDIF) ) ; 

7. 


7. MODEL INVARIANTS 

7. To prove the timeout automata model with time-triggered 
7. transitions is faithful. 

7. The bad echo array behaves correctly. 
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7. proved dl 

bad_echos_ascend: LEMMA 

system |- G(F0RALL (i: BAD.IDS) : 

FORALL (e: [1 . .pd_see_top-l] ) : 

bad_echos [i] [e] < bad_echos [i] [e+1] ) ; 

7, The reintegrator’s timeout is always less than the frame 
7. timeout and the echos of the faulty nodes. 

7. proved d2 -1 mode_cntrl -1 sc_init_frame_gap 
7. -1 fs_frame_gap -1 frame_prop -1 bad_echos_ascend 

reint_to_least : LEMMA 

system |- G( reint_to < frame_to 

AND (FORALL (i: BAD.IDS) : 

FORALL (e: [1 . .pd_see_top] ) : 

reint_to < bad_echos [i] [e] ) ) ; 

7. The reintegrator’s timeout is always within the current 
7. frame . 

7. proved d3 -1 reint_to_least -1 fs_frame_gap -1 synched 
current_f rame : LEMMA 

system |- G(frame_to - reint_to <= P) ; 

7. Whenever the reintegrator moves to a new frame, it 
7. does not immediately move to the window in which 
7. operational nodes issue their echos. 

7. proved d2 

good_f rame_update : LEMMA 

system |- G(new_frame => reint_to <= frame_to-pi) ; 

7. 


7. LIVENESS CHECKS 

7. These should be false. A counterexample to them provides 
7. some assurance that the model is not deadlocked (making 
7. conjectures vaccuously true) . 

7. Counter examples at 

7. d4 -1 frame_prop -1 mode_cntrl -1 pd_finish 

7. -1 pd_not_f s_seen -1 pd_not_sc_seen -1 fs_frame_gap 

7. _ 1 fs_window -1 f s_no_op_accs -1 no_op_accs 

7. - 1 synched -1 sc_init_frame_gap -1 pd_no_op_accs 

7. _ 1 current_f rame -1 reint_to_least 

pd_ck: LEMMA system |- G(mode /= pd_mode) ; 

fs_ck: LEMMA system |- G(mode /= f s_mode) ; 

sc_ck: LEMMA system |- G(mode /= sc_mode) ; 

7. 


END 
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